Please do not write your name into this document.

In the open-ended Exercises 2, 5, and 7, more extensive, carefully crafted, polished, insightful, and well motivated and explained answers will receive higher marks. Also see the assessment criteria on the course website https://lse-my472.github.io/

suppressMessages(library(tidyverse))
suppressMessages(library(readr))
suppressMessages(library(dbplyr))
suppressMessages(library(lubridate))
suppressMessages(library(ggthemes))
suppressMessages(library(colorspace))
suppressMessages(library(ggh4x))
suppressMessages(library(scales))
library(ggrepel)
library(httr)
suppressMessages(library(jsonlite))
suppressMessages(library(data.table))
library(DBI)
library(RSQLite)
library(XML)
suppressMessages(library(rvest))
suppressMessages(library(quanteda))
suppressMessages(library(quanteda.textplots))
suppressMessages(library(quanteda.textstats))
suppressMessages(library(syuzhet)) # for sentiment analysis (Exercise 6)
suppressMessages(library(devtools))
suppressMessages(library(hms))
suppressMessages(library(extrafont))
suppressMessages(library(extrafontdb))
suppressMessages(library(chron))
devtools::install_github("ricardo-bion/ggradar", 
                          dependencies = TRUE)
library(ggradar)
library(RColorBrewer)

Exercise 1 (4 points)

Using the file posts.csv in the data folder of this repo (the sample of 10,000 public Facebook posts by members of the US congress from 2017), solve the following with dplyr:

posts <- read_csv("data/posts.csv")

# Remove posts with 0 likes, create clr column.
posts <- posts %>%
  filter(likes_count > 0) %>%
  mutate(clr = comments_count/likes_count, month = as.numeric(format(date, "%m")))

# Isolate even months, group by screen_name, create normaliser, isolate data to be joined, make distinct.
even_posts <- posts %>% 
  filter(month %% 2 == 0) %>%
  group_by(screen_name) %>% 
  mutate(normaliser_based_on_even_months = max(clr) - min(clr)) %>%
  select(screen_name, normaliser_based_on_even_months) %>%
  distinct()

# Join normaliser to  posts data.
posts <- posts %>%
  left_join(even_posts, by = c("screen_name" = "screen_name")) 

# Set all `normaliser_based_on_even_months` that have a value of zero to NA or delete them
posts <- posts %>% 
  filter(normaliser_based_on_even_months > 0)

# create a column `normalised_clr` which stores the `clr` of all posts from the original data frame divided by the `normaliser_based_on_even_months` of the associated screen name. 
# Keep only those rows with `normalised_clr` > 0
posts <- posts %>% 
  mutate(normalised_clr = clr/normaliser_based_on_even_months) %>%
  filter(normalised_clr > 0)

# Arrange the data frame according to `normalised_clr` in ascending order
posts <- posts %>%
  arrange(normalised_clr)

# Print out only `screen_name` and `normalised_clr` for the first 10 rows, i.e. the posts with the 10 lowest `normalised_clr`
posts %>% 
  select(screen_name, normalised_clr) %>%
  head(10)
## # A tibble: 10 × 2
##    screen_name                   normalised_clr
##    <chr>                                  <dbl>
##  1 CongresswomanSheilaJacksonLee        0.00140
##  2 CongresswomanSheilaJacksonLee        0.00214
##  3 CongresswomanSheilaJacksonLee        0.00218
##  4 SenDuckworth                         0.00232
##  5 CongresswomanSheilaJacksonLee        0.00277
##  6 CongresswomanSheilaJacksonLee        0.00286
##  7 CongresswomanSheilaJacksonLee        0.00295
##  8 SenDuckworth                         0.00343
##  9 RepMullin                            0.00343
## 10 CongresswomanSheilaJacksonLee        0.00357

Hint: Any approach that yields the correct output here will receive full points, whether it uses multiple steps, merges, etc.

Exercise 2 (15 points)

After the dplyr warm-up in the previous exercise, the next task will be to apply your knowledge of dplyr and other packages to an interesting real world-example with more extensive tabular data. Since 2021, there has been a much discussed rise in inflation in many economies. Often only a single number for inflation is computed and reported. It typically is the price change of an average basket of goods. “Basket” here just describes a set of goods with weights in which an average consumer approximately buys them. With the weights and prices of the individual goods, a price for the basket can be computed. The number for inflation is then the change of the basket’s price over time.

There are many potential limitations of such an aggregate inflation figure. For example, businesses might be most interested in price changes only in their sectors. Furthermore, when prices of individual goods change in different ways, consumers will likely change the weights in which they buy goods. Hence, those aggregate inflation figures which keep the basket weights fixed for some time before reweighting, begin to deviate from the proportions in which people actually buy goods on average and become inaccurate. Also, while some goods see strong rises in prices, others do not. As a result, those consumers who buy goods in proportions far from the average consumer can face very different inflation figures. Taking all this into account, it can therefore be very interesting to analyse price changes of individual goods, smaller bundles of goods, etc. over time rather than only one aggregate number. Such findings can e.g. be helpful for researchers, businesses, consumers, or policy makers.

Since some years, the UK Office of National Statistics publishes detailed price data of goods which you can find here. The website also contains extensive documentation. Download the csv file for “Price quotes, October 2022 edition of this dataset” (the correct file should be around 13MB), and the same for October 2021 and October 2020 (this e.g. allows to compute price changes from year to year, but you can also download additional/different months if helpful for your analysis). Explore the data, reshape and process it, and think of ways in which you can analyse and visualise it with packages such as ggplot2 or plotly. You could e.g. look at prices of some goods vs. services, energy, how prices moved during different stages of the pandemic, etc. Thus, the answer can analyse prices changes for different goods and/or bundles of goods over time, report summary figures, and contain different visualisations. Also describe your analysis through written text via markdown. Did you find/learn anything interesting?

Note: As the downloaded data can quickly become too large for GitHub, you can store them outside your repository and load them from there along the lines of read_csv(some/path/to/somefile.csv).

# October 2022 and a year before
oct_22 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes202210.csv")
oct_21 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes202110.csv")
# Most recent data and the year before
nov_22 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes202211.csv")
nov_21 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes202111.csv")
# 6 months ago
apr_22 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes202204.csv")
# 18 months ago
apr_21 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes202104.csv")
# 2 years ago
oct_20 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes202010.csv")
# 3 years ago
oct_19 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes201910.csv")
# month before lockdown and pandemic began
feb_20 <- read_csv("/Users/christycoulson/Downloads/upload-pricequotes202002.csv")
# month after lockdown and pandemic began
apr_20 <- read_csv("/Users/christycoulson/Downloads/upload-202004pricequotes.csv")

data <- rbind(nov_22,oct_22,apr_22,nov_21,oct_21,apr_21,oct_20,apr_20,feb_20, oct_19)

data <- data %>%
  mutate(QUOTE_DATE = ym(QUOTE_DATE)) %>%
  select(-VALIDITY, -SHOP_CODE, -BASE_VALIDITY, -STRATUM_CELL)

# Select 4 basic products 
# butter (211305), milk (211710), bread (210102), spaghetti/pasta (210204)

# Select 4 luxury products
# NECKLACE FASHION APX 16/20 INS (520140), HOTEL 1 NIGHT PRICE (640406), BOTTLE OF CHAMPAGNE 75 CL (310423), PERFUME/FRAGRANCE MEN OR WOMEN (520234)

bas_lux_data <- data %>%
  filter(ITEM_ID %in% c("211305","211710","210102","210204","520140","640406","310423","520234")) %>% 
  mutate(item_type = case_when(ITEM_ID %in% c("211305","211710","210102","210204") ~ "Basic",
                               ITEM_ID %in% c("520140","640406","310423","520234") ~ "Luxury")) 

# Summarise Data for Averages and standard errors
bas_lux_data_summarised <- bas_lux_data %>%
  group_by(ITEM_ID, ITEM_DESC, QUOTE_DATE, item_type) %>%
  summarise(avg_price = mean(PRICE),
            sd_price = sd(PRICE),
            sd_price_relative = sd(PRICE_RELATIVE), 
            avg_price_relative = mean(PRICE_RELATIVE),
            cinf_low = avg_price_relative - 1.96*sd_price_relative,
            cinf_high = avg_price_relative + 1.96*sd_price_relative) 

This analysis will seek to understand the relative impact of inflation on wealth inequality in the U.K. from October 2019 to November 2022. I will use ‘basic’ and ‘luxury’ item groupings in order to disentangle said impact. The logic here is that those who are poorer cannot afford to spend on luxury items as they spend a large portion of their income on basic, essential products whereas those who are more wealthy can, and will, spend money on luxury items. Thus, a higher increase in inflation for basic items compared to luxury items will disproportionally affect the disposable income, and thus quality of life, of poorer families compared to more wealthy ones. I have chosen to include 4 products in each basket, which are specified below:

Basic:

Luxury:

These particular items were chosen due to the relatively large sample size, leading to less standard error in metrics. Also, the discourse around basics vs. luxury often mentions the aforementioned items to illustrate purchasing habits in the sense that champagne is often associated with the more wealthy, and items such as bread, butter, milk and dry pasta have been colloquially associated with the working class in Britain.

This analysis will also look at another interesting facet of wealth inequality in the U.K., regional inequality. This has been a prominent topic in the U.K. over the last couple of years, and there has been much talk around wealth inequality predicated on regional inequalities being deepened as a result of differing levels of access to investment and COVID relief funding during the last few years. I seek to better understand this context, and the intersectionality between relative basic/luxury inflation and regional inflation to challenge or affirm the notion that wealth inequality is deepening across multiple facets of British society.

In terms of sample selection for dates, I have used both an event-based and interval logic. The first COVID-19 lockdown came into effect in March of 2020, drastically changing economics in the U.K., and thus I have included October 2019 as our starting point, the last YoY date for our October 2022 data prior to COVID. I’ve also included February 2020, as this was the last month before the COVID-19 pandemic really ‘hit’ the U.K., and a month prior to the nation’s first lockdown. I have chosen 6 monthly and 12 monthly intervals relative to our base October 2022 data. This has been done to standardise time distance and thus combat over/underfitting in some modelling contexts. I have also included November 2022 data as it is the most recent data available.

First, let’s look at how the price of basic and luxury items have changed since prior to the first COVID-19 lockdown, from October 2019 until November 2022.

bas_lux_data_summarised %>%
  group_by(item_type, QUOTE_DATE) %>%
  summarise(avg_item_type_price = mean(avg_price)) %>%
  ggplot(aes(x = QUOTE_DATE, y = avg_item_type_price, colour = item_type)) +
  geom_smooth() +
  geom_jitter() +
  guides(color = guide_legend(title = "Item Type")) +
  ggh4x::facet_grid2(~item_type, scales = "free_y", independent = "y") +
  theme_economist() +
  labs(title = "Average Price of Basic and Luxury Goods from October 2019 until November 2022",
       subtitle = "Basic Products have, on average, experienced price inflation to a much greater extent than luxury products") +
  xlab("Date") +
  ylab("Average Price per Item Type") +
  theme(legend.position = "bottom", 
        plot.title = element_text(size = 11, hjust = 0), 
        plot.subtitle = element_text(size = 9, hjust = 0),
        strip.text.x = element_blank())

Basic vs. Luxury

First, let’s begin with a broad analysis of price inflation for both Basic and Luxury product groups. Despite being only a few data points, the data contributing to these descriptive statistics is summarised from 7444 observations of pricing for basic items and 7629 for luxury items. I chose to use the price variable here and aggregate up instead of the price relative variable because the price variable relative, according to the documentation, measures relative to the January price of that year. Thus, months later in the year are likely to have higher values due to the cumulative nature of inflation. The Line above uses soothed conditional means, and the confidence intervals are based on standard error.

As seen above, the average price of basic items has seen a consistent positive trend since October 2020. In contrast, luxury items have experienced s fairly neutral overall price change from October 2019 to November 2022 which has been characterised by both rises and falls in overall price. The outcome of this inflationary pressure is that, assuming that both a working class and wealthier family consume basic products at the same rate, the inflationary pressure from basic products disproportionality affects the poorer familys disposable income as the proportion of their wealth allocated to said product increases at a greater rate. In contrast, richer families, who are less affected by rises in basic product prices (in terms of the proportion of their overall wealth) can still purchase luxury products at similar prices as before. Thus, in Britain, poorer families will have less disposable income in terms of the proportion of their overall wealth in ratio to rich families as a result of consistent high inflation for basic goods.

Another thing to note about the above graphic is that both basic and luxury products experienced price decreases at the beginning of COVID-19. This continued until the discontinuing of the ‘Eat Out to Help Out’ scheme, which offered discounts on meals in the U.K., finished. This was followed by an uptick in inflation at around September/October 2020, where the infamous ‘rule of six’ rule was put in place and, interestingly, a second lockdown was enforced. It seems then, that the relationship between enforcing strict COVID-19 restrictions and price inflation is non-linear for both. We should look at the distributions of observed prices relative to January for their respective year to see how distribution shapes are different.

bas_lux_data %>%
  ggplot(aes(fill = item_type)) +
  geom_density(aes(x = PRICE_RELATIVE), alpha = 0.5) +
  guides(fill = guide_legend(title = "Item Type")) +
  xlim(0,2.1) +
  labs(title = "Price Change Relative to January of Respective Year for Basic and Luxury Goods Since October 2019", 
       subtitle = "Basic goods have increased in value to a greater proportion than luxury goods") +
  xlab("Price Change Relative to January of Respective Year") +
  ylab("Density") +
  theme_economist() +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "bottom", 
        plot.title = element_text(size = 10, hjust = 0), 
        plot.subtitle = element_text(size = 8, hjust = 0))

Despite the relative price change being compared to January of the given year, and the quote year being different for differing observations (thus values being measured relative to different values per year), they occur in comparable year/month combinations and thus we can compare densities, as if they were the same we would expect symmetry. For basic goods in total, the average relative price (inflation vs. January) is 0.999, meaning that prices, on average, stayed the same. This is lower than expected due to price drops prior to 2021. For luxury goods, the average relative price is 0.864. This indicates that, on average, basic goods rose in price more than luxury goods. Let’s look at prices since the beginning of 2021 to see if the pattern is different.

bas_lux_data_summarised %>%
  filter(year(QUOTE_DATE) > 2020) %>%
  group_by(item_type) %>%
  summarise(avg_price_relative = mean(avg_price_relative))
## # A tibble: 2 × 2
##   item_type avg_price_relative
##   <chr>                  <dbl>
## 1 Basic                  1.05 
## 2 Luxury                 0.891
# Basic = 1.055, Luxury = 0.891

bas_lux_data %>%
  filter(year(QUOTE_DATE) > 2020) %>%
  ggplot(aes(fill = item_type)) +
  geom_density(aes(x = PRICE_RELATIVE), alpha = 0.5) +
  guides(fill = guide_legend(title = "Item Type")) +
  xlim(0, 2.5) +
  labs(title = "Price Change Relative to January of Respective Year for Basic and Luxury Goods Since January 2021", 
       subtitle = "Basic goods have increased in value to a greater proportion than luxury goods") +
  xlab("Price Change Relative to January of Respective Year") +
  ylab("Density") +
  theme_economist() +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "bottom", 
        plot.title = element_text(size = 10, hjust = 0), 
        plot.subtitle = element_text(size = 9, hjust = 0))

There are two main insights we can derive from the above graphic. The first concerns the differing skews for basic and luxury goods.Luxury goods have an average relative price decrease of around 10.9%. In contrast, Basic goods rose in price by 5.5%, a difference of 16.4% (rounded). This results in a negative skew for luxury products (price decrease), and a positive skew for positive products (price increase). This difference can be viewed graphically above, as a greater density of basic average relative price sits between 1.0 and 1.5 (price stability to 50% price increase), whereas the density is greater for luxury products between 0.5 and 1.0 (50% price decrease to price stability). Interestingly, there is a large spike for luxury products around 100-90% average relative price decrease, relative to basic goods.

So, we can infer from the above graph that inflationary pressure since 2021 has likely affected poorer families disposable income and quality of life to a greater proportional extent that more wealthy families. However, there may be additional explanatory variables, such as region, which help explain the puzzle of wealth inequality distribution in Britain. Let’s explore this below.

In order to better understand regional inequality, I want to select one region that is regarded as wealthy, and one which is regarded as less wealthy. According to the Equality Trust, the South East and London are the wealthiest regions, and the North East and North West are the least wealthy. This analysis can be found here: https://equalitytrust.org.uk/scale-economic-inequality-uk#:~:text=Wealth%20is%20also%20unevenly%20spread,North%20West%20(%C2%A3165%2C200).&text=The%20UK’s%20wealth%20distribution%20is,to%20the%20other%20OECD%20countries. Thus, I will isolate those four regions, combine South East and London into ‘South East’ and combine North West and North (there is no North East region in the data) into ‘North’ These are crude definitions, but facilitate this analysis. Then, I will make a comparison on price inflation for these two differing regions to test the hypothesis that regional inequality exacerbates existing inequalities between rich and poor, which are evidenced in the two graphs above.

# Create dataset with just North and South East, create variable determining Region
data_N_SE <- data %>%
  filter(REGION %in% c(2,3,9,10) & ITEM_ID %in% c("211305","211710","210102","210204","520140","640406","310423","520234")) %>%
  mutate(Region_2 = case_when(REGION %in% c(2,3) ~ "South East", 
                              REGION %in% c(9,10) ~ "North"),
         item_type = case_when(ITEM_ID %in% c("211305","211710","210102","210204") ~ "Basic",
                               ITEM_ID %in% c("520140","640406","310423","520234") ~ "Luxury")) 

# Summarise dataset to create variables of interests.
N_SE_summarised <- data_N_SE %>%
  group_by(ITEM_ID, ITEM_DESC, QUOTE_DATE, Region_2, item_type) %>%
  summarise(avg_price = mean(PRICE),
            sd_price = sd(PRICE),
            sd_price_relative = sd(PRICE_RELATIVE), 
            avg_price_relative = mean(PRICE_RELATIVE),
            cinf_low = avg_price_relative - 1.96*sd_price_relative,
            cinf_high = avg_price_relative + 1.96*sd_price_relative)
N_SE_summarised %>%
  filter(year(QUOTE_DATE) > 2020) %>%
  group_by(item_type, Region_2, QUOTE_DATE) %>%
  summarise(avg_item_type_price_relative = mean(avg_price_relative)) %>%
  ggplot(aes(x = QUOTE_DATE, y = avg_item_type_price_relative, colour = item_type)) +
  geom_smooth(se = FALSE) +
  geom_jitter() +
  guides(color = guide_legend(title = "Item Type")) +
  ggh4x::facet_grid2(~Region_2) +
  theme_economist() +
  labs(title = "Average Relative Price Change for 2021 & 2022 for North and South East Regions",
       subtitle = "The north has consistently experienced a greater disparity in price inflation between basic and luxury goods") +
  xlab("Date") +
  ylab("Relative Price Change") +
  theme(legend.position = "bottom", 
        plot.title = element_text(size = 11, hjust = 0), 
        plot.subtitle = element_text(size = 9, hjust = 0),
        plot.caption.position = "plot")

Regional Inflation and Wealth Inequality

The Line above uses soothed conditional means. In the North of the country, the prices for basic goods have never reduced relative to January by more than 2%. In contrast, the South East experienced price deflation for basic goods until Mid-2022, with a peak in deflation at -7.5%. In addition, the gap in inflationary pressure between basic and luxury good consistently exceeded 10%. With a peak difference of around 38% (basic goods inflating 38% more than luxury goods, relative to the January of 2022), inflationary differences were greater between basic and luxury goods for the North than the South East. This pressure likely exacerbates the pre-existing regional wealth inequality between those in the North of the country, and those in the South East. It seems, then, that inflationary pressure has disproportionately affected poorer households when intersecting regional inequality with basic/luxury proportional spending habits of the wealthy vs. less-so wealthy. Let’s look at densities to understand whether this is reflected in the distributions of relative price changes.

data_N_SE %>%
  filter(year(QUOTE_DATE) > 2020) %>%
  ggplot(aes(fill = item_type)) +
  geom_density(aes(x = PRICE_RELATIVE), alpha = 0.5) +
  guides(fill = guide_legend(title = "Item Type")) +
  ggh4x::facet_grid2(~Region_2) +
  xlim(0, 2.5) +
  labs(title = "Price Change Relative to January of Respective Year for Basic and Luxury Goods Since January 2021", 
       subtitle = "Basic goods have increased in value to a greater proportion than luxury goods") +
  xlab("Price Change Relative to January of Respective Year") +
  ylab("Density") +
  theme_economist() +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "bottom", 
        plot.title = element_text(size = 10, hjust = 0), 
        plot.subtitle = element_text(size = 9, hjust = 0))

Somewhat surprisingly, the density patterns seem to be similar for the North and South East. The North has a slightly greater positive inflation skew for basic goods, whereas the South East has a fatter positive tail for Luxury goods. This difference likely represents the relative wealth of the two regions. However, overall, it seems that both the North and South East are distributed the same, likely as a result of being subject to the same macro-economic patterns that Britain has experienced over the last couple of years - namely Brexit, the Russian invasion of Ukraine, COVID-19, and lack of coherent domestic economic policy in Britain. To conclude, we already know that wealth inequality exists, and has existed in the U.K. for centuries. This is evidenced here: https://equalitytrust.org.uk/scale-economic-inequality-uk. Trends in research into wealth inequality in the last few decades have focused on intersectionality with gender, class, ethnicity and region. This research is evidenced in papers such as Warren’s 2008 analysis, found here: https://doi.org/10.1080/13545700500508502. The purpose of my analysis was to try to better understand whether inflation has worsened wealth inequality in the past few years. Inflation has affected basic, essential goods to a greater extent than it has luxury goods. As poorer households already spend a greater portion of their income on these basic goods, this difference in inflation means that inflation affects their disposable income, and thus quality of life, to a greater extent than wealthier households. This is especially true from 2021 onwards, where COVID-19 began to be placated by vaccine roll-outs and other worries began to affect prices in Britain, namely economic policy, energy crises, Brexit, and the war in Ukraine. This intersects with regional inequality. Poorer households in the North experienced greater relative inflation between basic and luxury goods than those in the wealthier South East. This deepens the inequality divide. This divide is again exacerbated by differences in government investment between the North and South East, which have been consistently reported on during and after COVID. An example of such reporting is here: https://www.theguardian.com/society/2022/jan/16/englands-north-south-divide-is-deepening-says-new-report. I anticipate such a growth in the gap between rich and poor will likely have broad social, political and cultural implications as we move through the 2020s. This would be an interesting avenue for further study.

Exercise 3 (10 points)

Use ggplot2 to try to replicate the following plot on parameter numbers in AI systems from Our World in Data. Start with the basic features of the plot and then try to get as close to the original as possible with ggplot2. There is no need to animate the plot, just use the latest static plot and replicate that. As the data might change over time, feel free to download the image version from the website that you are going to replicate. Then you can add this image into the repo and afterwards bind it into this markdown e.g. with ![](imagename.png) as a reference below your own plot.

Note: When you click on DOWNLOAD below the plot on the website, you can download the latest image as well as all data.

ai_data <- read_csv("data/artificial-intelligence-parameter-count.csv")
# Y axis break values
y_axis_breaks <- c(100,1000,10000,100000,1000000,10000000,100000000,1000000000,10000000000,100000000000,1000000000000)

# Final graph
ai_data %>%
  filter(!Domain %in% c("VIsion","Text-to-Video")) %>%
  ggplot(aes(x = Day, y = Parameters)) +
  geom_point(aes(fill = Domain), pch = 21) +
  geom_text_repel(aes(fill = Domain,
                      colour = Domain,
                      label = ifelse( Entity == "NPLM" | Entity == "Hiero" | Entity == "IBM-5" | Entity == "NetTalk" | Entity == "BiLSTM for Speech" | Entity == "Theseus" |  Entity == "Neocognitron" | Entity == "DIABETES" | Entity == "System 11" | Entity == "Self Organizing System" | Entity == "Pandemonium (morse)" | Entity == "Samuel Neural Checkers" | Entity == "ASE+ACE" | Entity == "Kohonen network" | Entity == "Fuzzy NN" | Entity == "SACHS" | Entity == "Peephole LSTM" | Entity == "LeNet-5" | Entity == "GroupLens" | Entity == "DEM" | Entity == "Deep Belief Nets" | Entity == "Feedforward NN"| Entity == "Dropout(TIMIT)" | Entity == "Word2Vec (large)" | Entity == "GPT-2" | Entity == "MoE" | Entity == "DLRM-2020" | Entity == "GShard(600B)" | Entity == "GNMT" | Entity == "Mitosis", Entity, "")),
            check_overlap = TRUE,
            max.overlaps = 14,
            max.time = 3,
            direction = "y",
            inherit.aes = TRUE,
            position = position_dodge(),
            show.legend = FALSE,
            size = 2.5) +
  guides(fill = guide_legend(title = "Task Domain", 
                             override.aes = list(size = 2, shape = 22),
                               title.theme = element_text(face = "bold", size = 9))) +
  scale_y_log10(labels = scales::label_number_si(), breaks = y_axis_breaks) +
  scale_x_date(date_labels = "%b %d, %Y", breaks = as.Date(c("1950-07-02","1978-12-27", "1992-09-04", "2006-05-14"))) +
  scale_fill_manual(values = c("Drawing" = "darkorchid4","Driving" = "chocolate3","Games" = "seagreen3","Language" = "royalblue3","Multimodal" =  "firebrick4","Other" = "burlywood3","Reading" = "blue4","Recommendation" = "hotpink3","Robotics" = "darkgreen","Speech" = "sienna4","Video" = "turquoise2","Vision" = "seagreen4")) +
    scale_colour_manual(values = c("Drawing" = "darkorchid4","Driving" = "chocolate3","Games" = "seagreen3","Language" = "royalblue3","Multimodal" =  "firebrick4","Other" = "burlywood3","Reading" = "blue4","Recommendation" = "hotpink3","Robotics" = "darkgreen","Speech" = "sienna4","Video" = "turquoise2","Vision" = "seagreen4")) +
  labs(title = "Number of parameters in notable artificial intelligence systems",
       subtitle = "Parameters are variables in an AI system whose values are adjusted during training to establish how input data gets\n transformed into the desired output; for example, the connection weights in an artificial neural network.", 
       caption  = "Source: Sevilla et al. (2022)                                                    OurWorldinData.org/artificial-intelligence CC BY
Note: Parameters are estimated based on published results in the Al literature and come with some uncertaintv. The authors expect the estimate\nto be correct within a factor of 10.") +
  xlab("Publication date") +
  ylab("Parameters") +
  theme_minimal() +
  theme(plot.title = element_text(size = 15, colour = "grey30", family = "serif"),
        plot.subtitle = element_text(size = 9, colour = "grey30", family = "serif"),
        axis.text = element_text(size = 8),
        axis.title.x = element_text(size = 8),
        axis.title.y = element_text(size = 8),
        legend.text = element_text(size = 9),
        legend.justification = "top",
        legend.box.margin = margin(0,0.5,0.5,0.5),
        legend.spacing.y = unit(0.01, "cm"),
        legend.spacing.x = unit(0, "cm"),
        plot.caption = element_text(colour = "grey50", size = 7.5,hjust = 0),
        plot.tag.position = c(.9,.3), 
        plot.tag = element_text(size = 10),
        panel.grid.major = element_line(linetype = "dashed"))

Exercise 4 (7 points)

When analysing textual data, the approaches we discussed such as counting tokens and building dictionaries, are already sufficient to answer many research questions. When the goal is to obtain an overview of information contained in unstructured textual data, however, some additional approaches such as “named-entity linking” can be very helpful to know. Using a language model and a knowledge base, these tools can find words which describe people, institutions, places, etc., and importantly also obtain their ID in databases such as Wikidata. This allows to automatically extract an approximate list of e.g. institution who have a Wikidata entry and are mentioned in the text, which is very convenient for obtaining additional information via Wikidata/Wikipedia afterwards.

An open-source option is OpenTapioca which you can try in your browser via this link. It is e.g. implemented in Python as a plugin to the library spaCy. The task of this exercise is to build an own simple function from scratch in R that allows to use OpenTapioca. A good first step is usually to check out the source code of the original (here Python) function which you can find

here. There is no need to know Python for this exercise, just looking at “url” and the function “make_request” in the code already indicates that it is actually just querying an API. Thus, using httr it is possible to build a simple function also in R. I have started this function, your task is to complete it and then demonstrate that it works with the exemplary sentences added below.

The function should return a tibble as output named top_wikidata_links with one row being one entity that was found in the input text and four columns being “id” (the Wikidata id), “label” (the name), “description” (a short description returned by the API via “$desc” – note that for some entities this description can be null, in that case just set the description in the output tibble to NA), and “score” (higher generally indicates a better fit to the text). You can also add further columns to your returned tibble, but the first four need to be these. The tibble should have as many rows as entities that were found in the text and had a score larger than minimum_score, which is a cutoff that the user can choose (all other matches returned by the API call with lower scores should not be included in the tibble).

Note that this open-source API does not work perfectly and misses some entities in texts, but can already be very helpful when exploring textual data. With very long input texts, it could also be tried whether breaking them down into smaller parts (e.g. sentences or paragraphs) before inputting them into the API increases accuracy or not.

Hint: Note that in the first test below, the two detected entities will be “Karl Popper” and “LSE”. “Karl Popper” only returns a single match, however the “LSE” part of the text will return also matches other than the London School of Economics and Political Science, e.g. the London Stock Exchange. For each entity that the API returns, the tibble returned by the function needs to include a row as long as its associated score is larger than the minimum score.

get_wikidata_links <- function(input_text, minimum_score) {
  
  #
  # Function which takes a character vector of length 1 as input (i.e. all text
  # needs to be combined into a single character) as well as a minimum certainty
  # score, and returns a tibble with key information and links to Wikidata
  #
  # Input
  #  - input_text: Text input (character)
  #  - minimum_score: Minimum score that every returned entity needs to have
  #                   (numeric)
  #
  # Output
  #  - top_wikidata_links: Table with the first four columns being 'id', 'label',
  #               'description', 'score' (tibble)
  #
  
  top_wikidata_links <- data.frame(id = character(),
  label = character(),
  desc = character(),
  score = numeric())

  base_url <- "https://opentapioca.org/api/annotate" 

  Text <- GET(base_url, query = list(query = input_text)) %>%
    content("text") %>% 
    fromJSON(flatten = FALSE)

  Data <- Text$annotations$tags
   
  for (i in 1:length(Data)) {
      
    d_row <- Data[[i]] %>%
      select(id, label, desc, score)
    
    top_wikidata_links <- rbind(top_wikidata_links, d_row)
    
    top_wikidata_links <- as_tibble(top_wikidata_links) %>%
      filter(score > minimum_score)
  
  }
  
  return(top_wikidata_links)
}

Next demonstrate that your function works by running the following two tests:

# Test 1
text_example_1 <- "Karl Popper worked at the LSE."
get_wikidata_links(text_example_1, -0.5)
## # A tibble: 3 × 4
##   id      label                                            desc            score
##   <chr>   <chr>                                            <chr>           <dbl>
## 1 Q81244  Karl Popper                                      Austrian-Brit…  2.46 
## 2 Q174570 London School of Economics and Political Science university in…  1.47 
## 3 Q171240 London Stock Exchange                            stock exchang… -0.412
# 
# Hint: The output should be a tibble similar to the one outlined below
#
# | id | label | description | score |
# | "Q81244" | "Karl Popper" | "Austrian-British philosopher of science" | 2.4568285 |
# | "Q174570" | "London School of Economics and Political Science" | "university in Westminster, UK" | "1.4685043" |
# | "Q171240" | "London Stock Exchange" | "stock exchange in the City of London" | "-0.4124461" |
Sys.sleep(20)

# Test 2
text_example_2 <- "Claude Shannon studied at the University of Michigan and at MIT."
get_wikidata_links(text_example_2, 0)
## # A tibble: 3 × 4
##   id      label                                 desc                       score
##   <chr>   <chr>                                 <chr>                      <dbl>
## 1 Q92760  Claude Shannon                        American mathematician an… 1.96 
## 2 Q230492 University of Michigan                public research universit… 1.29 
## 3 Q49108  Massachusetts Institute of Technology research university in Ca… 0.902

Exercise 5 (25 points)

The New York Times (NYT) APIs offer a rare free opportunity to analyse news data since 1851 (the year of the first issue of the paper). The “Archive API” allows to download all article headlines and lead paragraphs/snippets for a given month (and through iteration potentially for a range of years). While this does not allow to obtain the full articles, it still allows to download a substantial amount of textual data and thereby do own text analysis.

The task of this exercise is to use the Archive API (not the Article Search API) to develop and present a coherent analysis of a topic of your choice. This could e.g. be the news coverage of a specific historical event somewhere in the last 130 years that you are interested in (in that case the downloaded archive data would be the month(s) of that event) or also a general topic over time. While the NYT Archive API is the primary source of the analysis, in a next step also use either Wikidata (e.g. accessible via these options such as an API or the Wikidata websites themselves), or alternatively scrape Wikipedia to obtain further information on topics relevant for the selected newspaper texts. You can get links between NYT texts and Wikipedia e.g. via the function developed on Exercise 4 (or some variation of that function) and/or just manually by reading the articles and deciding which information to look up on Wikidata/Wikipedia. Your answer can e.g. contain text analysis with quanteda, word clouds, visualisations with ggplot2 or plotly, analysis of tabular data with dplyr, or use of further R packages. Also motivate and describe your analysis through markdown texts.

As the downloaded data can again quickly become too large for GitHub, you can store them outside your repository and load them from there.

Hint 1: You might find the code examples from week 5 helpful for this exercise. For the Archive API, the last section in the file 02-nytimes-api.Rmd illustrates how you can transform its output into a data frame. Usually the main available text is contained in the headline, abstract, lead_paragraph, and/or snippet columns (depending on which month/year you download some of these can be empty or the same). Note that the headline information in the data frame is nested and can be accessed with df$headline$main.

Hint 2: Named entity linking with tools such as OpenTapioca gives identifiers for Wikidata such as e.g. “Q7099”. Not every Wikidata page also has an associated Wikipedia page, but those that have do contain links to the Wikipedia page. You can e.g. see this on Emmy Noether’s Wikidata page https://www.wikidata.org/wiki/Q7099 which contains links to her Wikipedia page in many languages in case you prefer to scrape data from Wikipedia rather than using Wikidata.

# Studying the news reported from the NYT on the Rwandan Genocide
# Genocide went from April 7 until 15 July 1994. 
# 20 year anniversary was April 2014. 

apikey <- "1AQ9XbHC4ukGUg9OfJpFgNGok4tAUix3"

# Genocide period April - July 2014.
rg_nyt_apr <- GET("https://api.nytimes.com/svc/archive/v1/1994/4.json?api-key=1AQ9XbHC4ukGUg9OfJpFgNGok4tAUix3")
rg_nyt_may <- GET("https://api.nytimes.com/svc/archive/v1/1994/5.json?api-key=1AQ9XbHC4ukGUg9OfJpFgNGok4tAUix3")
rg_nyt_jun <- GET("https://api.nytimes.com/svc/archive/v1/1994/6.json?api-key=1AQ9XbHC4ukGUg9OfJpFgNGok4tAUix3")
rg_nyt_jul <- GET("https://api.nytimes.com/svc/archive/v1/1994/7.json?api-key=1AQ9XbHC4ukGUg9OfJpFgNGok4tAUix3")
# 20 year anniversary
rg_nyt_20_years <- GET("https://api.nytimes.com/svc/archive/v1/2014/4.json?api-key=1AQ9XbHC4ukGUg9OfJpFgNGok4tAUix3")

# April 1994
rg_nyt_apr_json_as_text <- content(rg_nyt_apr, "text")
json_apr_1994 <- fromJSON(rg_nyt_apr_json_as_text)
df_apr_1994 <- json_apr_1994$response$docs %>% as_tibble() %>%
  mutate(headline_txt = headline$main) %>%
    select("abstract", "lead_paragraph", "headline_txt", "pub_date", "_id") 

# May 1994
rg_nyt_may_json_as_text <- content(rg_nyt_may, "text")
json_may_1994 <- fromJSON(rg_nyt_may_json_as_text)
df_may_1994 <- json_may_1994$response$docs %>% as_tibble() %>%
  mutate(headline_txt = headline$main) %>%
    select("abstract", "lead_paragraph", "headline_txt", "pub_date", "_id") 

# June 1994
rg_nyt_jun_json_as_text <- content(rg_nyt_jun, "text")
json_jun_1994 <- fromJSON(rg_nyt_jun_json_as_text)
df_jun_1994 <- json_jun_1994$response$docs %>% as_tibble() %>%
  mutate(headline_txt = headline$main) %>%
    select("abstract", "lead_paragraph", "headline_txt", "pub_date", "_id") 

# July 1994
rg_nyt_jul_json_as_text <- content(rg_nyt_jul, "text")
json_jul_1994 <- fromJSON(rg_nyt_jul_json_as_text)
df_jul_1994 <- json_jul_1994$response$docs %>% as_tibble() %>%
  mutate(headline_txt = headline$main) %>%
    select("abstract", "lead_paragraph", "headline_txt", "pub_date", "_id") 

# 20 Year Anniversary
rg_nyt_201404_json_as_text <- content(rg_nyt_20_years, "text")
json_jul_201404 <- fromJSON(rg_nyt_201404_json_as_text)
df_201404 <- json_jul_201404$response$docs %>% as_tibble() %>%
  mutate(headline_txt = headline$main) %>%
    select("abstract", "lead_paragraph", "headline_txt", "pub_date", "_id") 

rg_nyt <- rbind(df_apr_1994, df_may_1994, df_jun_1994, df_jul_1994, df_201404)

# Make pub_date date/time data type
rg_nyt <- rg_nyt %>%
  mutate(pub_date = as.Date(substr(pub_date, 1, 10), "%Y-%m-%d")) %>%
  filter(grepl("Rwanda", abstract))
# Now Wikipedia

rwandan_genocide_url <- "https://en.wikipedia.org/wiki/Rwandan_genocide"
rwandan_genocide_html <- read_html(rwandan_genocide_url)
rwandan_genocide_text_content <- rwandan_genocide_html %>% html_elements(css = "p") %>% html_text() %>% paste(collapse = " ")

rwandan_genocide_tables <- html_table(rwandan_genocide_html, fill = TRUE)
# split during genocide from after genocide
rg_nyt_1994 <- rg_nyt %>%
  filter(year(pub_date) < 2000) 

rg_nyt_2014 <- rg_nyt %>%
  filter(year(pub_date) > 2000) 

# corpus of abstract from 1994
rg_nyt_abs_corpus_1994 <- corpus(rg_nyt_1994$abstract)

# corpus of abstract from 2014
rg_nyt_abs_corpus_2014 <- corpus(rg_nyt_2014$abstract)

# dfm 1995
rg_nyt_abs_1994_dfm <- tokens(rg_nyt_abs_corpus_1994, remove_punct = TRUE) %>% 
  tokens_remove(stopwords("en")) %>% 
  tokens_wordstem() %>%  
  dfm() %>% 
  dfm_trim(min_termfreq = 2)

# dfm 2014
rg_nyt_abs_2014_dfm <- tokens(rg_nyt_abs_corpus_2014, remove_punct = TRUE) %>% 
  tokens_remove(stopwords("en")) %>% 
  tokens_wordstem() %>%  
  dfm() %>% 
  dfm_trim(min_termfreq = 2)

Exercise 5: Using the NYT Archive API to Examine the Rwandan Genocide.

This analysis will centre around the reporting on the 1994 Rwandan Genocide, which took place between 7 April 1994 and 15 July 1994 in the midst of the the Rwandan Civil War. The violence erupted after the assassination of Rwandan president, Juvénal Habyarimana. I have extracted data from April to July of 1994, which covers the genocide period, and reporting from April 2014, which marked the 20th year anniversary of the conflict. During this period, members of the Tutsi ethnic minority group, as well as some moderate Hutus and Twa, were killed by armed Hutu (ethnic majority) militias and their associates. It is estimated that between 500,000 and 662,000 Tutsis died during this period of around 100 days. In addition, an estimated 250,000 to 500,000 women were said to experience sexual violence during this time. I will allow an extract from Wikipedia to add some colour to the events leading to the beginning of the Rwandan Genocide. This can be found below:

html_elements(rwandan_genocide_html, css = "#mw-content-text > div:nth-child(1) > p:nth-child(12)") %>%
  html_text()  %>%
  str_replace_all("\n", "") %>%
  str_replace_all("\\[4]", "")
## [1] "In 1990, the Rwandan Patriotic Front (RPF), a rebel group composed mostly of Tutsi refugees, invaded northern Rwanda from their base in  Uganda, initiating the Rwandan Civil War. Over the course of the next three years, neither side was able to gain a decisive advantage. In an effort to bring the war to a peaceful end, the Rwandan government led by Hutu president, Juvénal Habyarimana signed the Arusha Accords with the RPF on 4 August 1993. The catalyst became Habyarimana's assassination on 6 April 1994, creating a power vacuum and ending peace accords. Genocidal killings began the following day when majority Hutu soldiers, police, and militia murdered key Tutsi and moderate Hutu military and political leaders."

The genocide was characterised by its brutality. Many victims were killed in their own locale, and both neighbours and family members collaborated with militias in the identification and murder of thousands of Tutsi, as well as some moderate Hutu and Twa. Despite this, there were no interventions by foreign powers until the UN’s non-armed presence. Due to this brutality, it is wise to begin with a frequency analysis of the types of words used in the NYT during the period to cover the event to see if the coverage captured such a characteristic.

# most frequent words 1994
names(topfeatures(rg_nyt_abs_1994_dfm, 100))
##   [1] "rwanda"    "nation"    "unite"     "rwandan"   "refuge"    "said"     
##   [7] "new"       "troop"     "forc"      "today"     "presid"    "countri"  
##  [13] "offici"    "kill"      "state"     "rebel"     "clinton"   "relief"   
##  [19] "french"    "govern"    "two"       "a1"        "1"         "peopl"    
##  [25] "report"    "war"       "say"       "last"      "zair"      "death"    
##  [31] "one"       "franc"     "camp"      "militari"  "african"   "tutsi"    
##  [37] "american"  "thousand"  "citi"      "plan"      "first"     "kigali"   
##  [43] "capit"     "border"    "year"      "day"       "help"      "world"    
##  [49] "now"       "million"   "time"      "week"      "hutu"      "mani"     
##  [55] "africa"    "peacekeep" "internat"  "send"      "major"     "page"     
##  [61] "cholera"   "civil"     "began"     "die"       "front"     "call"     
##  [67] "massacr"   "health"    "aid"       "leader"    "tri"       "administr"
##  [73] "burundi"   "member"    "minist"    "soldier"   "foreign"   "effort"   
##  [79] "armi"      "south"     "peac"      "three"     "secur"     "fight"    
##  [85] "worker"    "protect"   "month"     "group"     "end"       "back"     
##  [91] "power"     "neighbor"  "town"      "bosnia"    "patriot"   "ethnic"   
##  [97] "u.n"       "home"      "council"   "still"
# wordcloud 1994
textplot_wordcloud(rg_nyt_abs_1994_dfm, rotation=0.1, random_order = FALSE, min_size= 0.5, max_size=8, min_count = 10, max_words=200, color = "red")

Interestingly, nationalistic tropes such as ‘nation,’ ‘unite,’ ‘rwandan,’ are the most frequently words used, despite being a war within a single country rather than between two nation states. This likely alludes to the internal Rwandan discourse during the genocide, whereby the genocide occurred off the back of a lengthy civil war in which the dominant mobilising Hutu discourse of the conflict concerned the characterisation of the Tutsi minority as a foreign, homogeneous threat to Hutu interests. It’s interesting that this is captured by a western media outlet such as the NYT.

The second tier of word frequencies seems to revolve around a common theme, acts of political violence. Words such as ‘kill,’ ‘death,’ ‘war,’ ‘rebel,’ ‘and ’massacre’ are common throughout. The reported brutality captured the western world in 1994, and this descriptive language likely contributed to the worldwide awareness.

Despite this awareness, however, the international community has been criticised for its lack of response to the genocide. The third tier of word frequency contains themes of foreign powers, colonialism, intervention and aid. The word ‘French’ is particularly interesting because from 1819 Rwanda was a German colony, and was administered by Belgium from 1919 via indirect rule. This is likely so frequent as the French government actively supported the Hutu-led government of Habyarimana, prior to his assassination. In addition, the French launched Amaryllis (Operation Turquoise) to evacuate French and Belgian expatriates whilst refusing to take any Tutsi Rwandans with them. This was the source of significant controversy at the time. Words such as ‘report,’ ‘relief,’ ‘food,’ ‘Clinton,’ and ‘govern’ refer directly to the strange paradox that, despite wordwide attention, foreign intervention did not occur until the Belgian-led White Helmet (United Nations) presence towards the end of the fighting. The Rwandan genocide is extemely important in the context of international relations as it was said to be the catalyst for the Responsibility to Protect (R2P) United Nations doctrine that opened the way for future foreign interventions in the 21st century. We will explore this in greater detail towards the end of this analysis. First, however, let’s perform a quantitative sentiment analysis to better understanding the general sentiment themes produced in the NYT concerning the Rwandan genocide during its occurrence.

# isolate headlines
headlines <- rg_nyt %>%
  filter(year(pub_date) < 2000) %>%
  select(headline_txt) 

# convert given character vectors between encodings.
headlines <- iconv(headlines)

headlines_sentiment <- get_nrc_sentiment(headlines) 

barplot(colSums(headlines_sentiment),
        las = 2,
        col = rainbow(50),
        ylab = 'Count',
        main = 'NYT Rwandan Genocide Reporting Sentiment Scores')

5.2 Sentiment Analysis

The two most frequent themes were ‘negative’ and ‘fear’ in NYT reporting of the genocide during its occurrence. This aptly characterises both the feeling of the time within Rwanda and the view of the Western world when reading and writing about the genocide. Joy and surprise were the least two populous sentiments. The fact that surprise is so low is interesting as there has been a lengthy debate concerning the potential pre-meditation of such widespread ethnic violence in Rwanda at the time. Prosecutors at the International Criminal Tribunal for Rwanda (ICTR) argued that the perpetrators of the genocide had planned the violence prior to the assassination of President Habariyama - but were unable to prove so. This was commented on by an expert witness at the ICTR prosecution in 2010, which can be seen below:

html_elements(rwandan_genocide_html, css = ".templatequote > p:nth-child(1)") %>%
  html_text()  %>%
  str_replace_all("\\[122]", "") %>%
  str_replace_all('\\"conspiracy\\"',"'conspiracy'")
## [1] "What the Office of the Prosecutor has consistently failed to demonstrate is the alleged existence of a 'conspiracy' among the accused—presuming an association or a preexisting plan to commit genocide. This is the central argument at the core of its prosecution strategy, borrowing from the contentions initially put forth by academics and human rights defenders. With the exception of two judgements, confirmed on appeal, the Trial Chambers have uniformly found the prosecution's proof of a conspiracy wanting, regardless of the case."

Overall, however, the general sentiment themes occur as expected. Next, let’s look at the frequency of reporting in the NYT regarding Rwanda during the genocidal period and see if frequencies speak to specific events.

rg_nyt$index <- 1:nrow(rg_nyt)
# significant event dates in Rwandan Genocide
event_dates <- as.Date(c("1994-04-06","1994-04-09","1994-06-23", "1994-07-04","1994-07-18"))
rg_nyt %>%
  filter(year(pub_date) < 1995) %>%
  ggplot(aes(x = pub_date, family = "serif")) +
  geom_bar(stat = "count", fill = "blue") + 
  scale_y_continuous(breaks = seq(0, 10, by = 2)) +
  geom_vline(xintercept = as.numeric(event_dates),
             col = "red",
             lwd = 0.75) +
  geom_text(aes(x = event_dates[1], y = 6, label="Assassination of President Habyarimana"), size = 2.5, vjust = -0.4, angle = 90, col = 
  "black") +
  geom_text(aes(x = event_dates[2], y = 8, label="Interim Govt Sworn in"), size = 2.5, vjust = 1.2, angle = 90, col = 
  "black") +
  geom_text(aes(x = event_dates[3], y = 8, label="Opération Turquoise Commences"), size = 2.5, vjust = -0.4, angle = 90, col = 
  "black") +
  geom_text(aes(x = event_dates[4], y = 8, label="RPF take Kigali"), size = 2.5, vjust = -0.4, angle = 90, col = 
  "black") +
  geom_text(aes(x = event_dates[5], y = 8, label="RPF take Rwanda"), size = 2.5, vjust = -0.4, angle = 90, col = 
  "black") +
  labs(title = "Number of NYT Articles on Rwanda During the Genocide Period",
       subtitle = "The majority of articles concerning the Rwandan Genocide occurred at the beginning of, or after, the conflict.",
       y = 'Number of Articles', 
       x = 'Date') +
  theme_few() +
  theme(plot.title = element_text(size = 12, family = "serif"),
        plot.subtitle = element_text(size = 9, family = "serif"),
        axis.title = element_text(size = 10, family = "serif"))

5.3 Events during the Genocide

As seen above, and perhaps most interestingly, is that the majority of articles concerning the Rwandan genocide in the NYT during its occurrence were published at the beginning and end of the conflict. In fact, the NYT’s coverage of this protracted ethnic cleansing event was sporadic during the time where the majority of the killings occurred - between April 9th and the beginning of July. Information regarding the genocide was hard to come by during the bulk of the killings. This is likely because the Rwandan radio channels were actively involved in the mobilisation of the Hutu ethnic majority towards the killing of Tutsis. In fact, the major Hutu-Rwandan radio station, named ‘Radio Télévision Libre des Mille Collines,’ was given the nickname ‘Radio Machete’ due to their active role in the incitement of the genocide. Thus, trustworthy news sources were few in number during most of the crisis - and accessing information was potentially fatal for both Rwandans and foreign individuals. News resources such as televisions and newspapers were not popular due to a lack of resources, which was only exacerbated by high rates of local illiteracy. Only after the re-establishment of national governance in Rwanda and the presence of foreign officials did trustworthy information regarding the scope, motivations and methods of the genocide begin to percolate to western news media outlets. Despite the frequency of reporting on the Rwandan genocide being fewest during the times of the most fighting, this aligns with what we understand about the information mechanisms in Rwanda during the time and the control over said information by the Hutu ethnic majority until the RFP’s occupation of Kigali in early July. The importance of trustworthy information is reflected in the fact that the Human Rights Watch’s seminal report on the genocide, titled “Leave None to Tell the Story,” was published in March of 1999 - almost 4 years after the conflict.

5.4 Tools of War: Sexual Violence as Weapons of Genocide

The Rwandan genocide is well-known for the use of machetes, blunt-force objects and other intermediary technologies in contrast to more advanced firepower-based weaponry. This is likely a product of relative accessibility, as Dr. Verwimp’s quantitative analysis of weaponry usage during this period suggests: https://journals.sagepub.com/doi/pdf/10.1177/0022343306059576. This choice of weapon contributed to the brutality of this ethnic cleansing attempt. However, perpetrators of the genocide also utilised another form of violence to a large extent - sexual violence. A report by Binaifer Nowrojee titled ‘Shattered Lives: Sexual Violence During the Rwandan Genocide and Its Aftermath,’ published in 1996, is one of the most highly cited human rights reports to date.

However, was this strategy of sexual violence as a means of coercion known to those reporting on the genocide in the NYT at the time?

abstract_1994 <- rg_nyt_1994$abstract

# sum of abstracts mentioning 'sex' or 'rape' divided by number of articles in 1994.
(abstract_1994 %>%
  str_detect("sex") %>%
  sum() + abstract_1994 %>%
  str_detect("rape") %>%
  sum())/ length(abstract_1994)
## [1] 0.03056769

Only 7 of 229 (3.06%) articles during the genocidal time period contained the word ‘sex’ or ‘rape’ within the abstract. You would think that if sexual violence was a key theme of the article, then either of those two words would be included in the abstract. This constitutes a significant information gap compared to what we know now. The NYT articles, then, are likely not a useful information source for exploring the use of sexual violence durng the genocide. I will turn to Wikipedia instead.

sex_viol_url <- "https://en.wikipedia.org/wiki/Rape_during_the_Rwandan_genocide"
sex_viol_html <- read_html(sex_viol_url)
sex_viol_text_content <- sex_viol_html %>% html_elements(css = "p") %>% html_text() %>% paste(collapse = " ")

html_elements(sex_viol_html, css = ".mw-parser-output > blockquote:nth-child(14) > p:nth-child(1)") %>%
  html_text()  
## [1] "For 60 days, my body was used as a thoroughfare for all the hoodlums, militia men and soldiers in the district ... Those men completely destroyed me; they caused me so much pain. They raped me in front of my six children ... Three years ago, I discovered I had HIV/AIDS. There is no doubt in my mind that I was infected during these rapes."

The above quote illustrates the extent and brutality to which sexual violence was deployed and the long-lasting effects of such a strategy. Research by Walsh (2012) also suggests that almost every single female survivor over the age of 12 years old had been a victim of rape. This begs an interesting question - why was this theme of sexual violence not reported on by western media outlets to a greater extent during the genocide? Is this a product of a media bias against reporting on sexual violence, or a result of missing information for NYT journalists in 1994? This might be an interesting avenue for further research if one was able to construct plausible causal mechanisms to add to the body of descriptive inferential research available, such as in the following work:

https://unu.edu/publications/articles/rape-and-hiv-as-weapons-of-war.html, https://searchworks.stanford.edu/view/13037847, https://www.grin.com/document/199827, Walsh, Annelotte (2012). “The Girl Child”. In Lisa Yarwood (ed.). Women and Transitional Justice: The Experience of Women as Participants. Routledge.

# Extract survivors table from Wikipedia
Survivors <- rwandan_genocide_tables[[3]] 

names(Survivors)[names(Survivors) == 'Number of survivors'] <- 'no_of_survivors'

Survivors$no_of_survivors <- Survivors$no_of_survivors %>%
  str_replace_all(",", "")

Survivors %>%
  mutate(no_of_survivors = as.numeric(no_of_survivors)) %>%
  arrange(no_of_survivors) %>%
  ggplot(aes(x = reorder(Category,no_of_survivors), y = no_of_survivors, fill = Category, family = "serif")) +
  geom_bar(stat = 'identity') +
  labs(title = "Types and Numbers of Survivors of the Rwandan Genocide",
       subtitle = "Types of survivors are varied and vast in number, evidencing that the conflict affected all facets of society in Rwanda",
       y = 'Number of Survivors', 
       x = 'Type of Survivor') +
  theme_tufte() +
  theme(plot.title = element_text(size = 12),
        plot.subtitle = element_text(size = 9),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        legend.title = element_blank())

5.5 Aftermath: Survivors, Two Congo Wars & Responsibility to Protect (R2P)

As shown in the graphic above, the the Rwandan genocide had a varied impact the lives and prospects of survivors. High school graduates were the fewest in number, and those grouped as ‘very vulnerable survivors’ were by far the most populous group, with over double the number of the second most populous group - Widows. In short, the damage was comprehensive and affected all facets of public and private life - from the family to the socio-economic and socio-cultural.

In addition to affecting those in Rwanda, the genocide was said to be a major catalyst for the two Congo Wars that followed after approximately two million Hutu fled to refugee camps in neighbouring countries, particularly Zaire (now DRC), in fear of retribution from the RFP following their victory in July.

Finally, one of the more significant international outcomes of the Rwandan genocide was the formulation of the Responsibility to Protect (R2P) doctrine. This was born from the widespread horror of nations who watched as both the Rwandan genocide and Srebrenica genocide occurred with little foreign intervention until too late.

# Extract Responsibility to Protect info from Wikipedia
R2P_url <- "https://en.wikipedia.org/wiki/Responsibility_to_protect"
R2P_html <- read_html(R2P_url)
R2P_text_content <- R2P_html %>% html_elements(css = "p") %>% html_text() %>% paste(collapse = " ")

html_elements(R2P_html, css = ".mw-parser-output > p:nth-child(31)") %>%
  html_text() %>%
  str_replace_all("\\[38\\]","") %>%
  str_replace_all("\\[39\\]","") %>%
  str_replace_all("\\[40\\]","") %>%
  str_replace_all("\\[41\\]","") %>%
  str_replace_all("\\n","") %>%
  substring(1,356)
## [1] "The norm of the R2P was born out of the international community's failure to respond to tragedies such as the Rwandan genocide in 1994 and the Srebrenica genocide in 1995. Kofi Annan, who was Assistant Secretary-General at the UN Department for Peacekeeping Operations during the Rwandan genocide, realized the international community's failure to respond."

R2P is a doctrine committed to by all member states of the United Nations that seeks to address four key human rights concerns: genocide, war crimes, ethnic cleansing and crimes against humanity. This has become a well-established international norm in the last two decades. The R2P doctrine has three pillars, seen below:

r2p_pillars <- html_elements(R2P_html, css = ".mw-parser-output > ol:nth-child(4)") %>%
  html_text() %>%
  str_replace_all("\\n","") %>%
  str_replace_all('\\"','') %>%
  str_replace_all('"','') %>%
  str_replace_all('\\[9\\]','') %>%
  str_replace_all('\\[10\\]','') %>%
  str_split(pattern = "Pillar")

r2p_pillars[[1]][2:4]
## [1] " I: The protection responsibilities of the state – Each individual state has the responsibility to protect its population from genocide, war crimes, ethnic cleansing, and crimes against humanity"
## [2] " II: International assistance and capacity-building – States pledge to assist each other in their protection responsibilities"                                                                     
## [3] " III: Timely and decisive collective response – If any state is manifestly failing in its protection responsibilities, then states should take collective action to protect the population."

Although the doctrine was unanimously supporting at the 2005 World Summit, R2P has been the subject of significant debate. Sovereign states have accused nations such as the United States and United Kingdom of using the doctrine as a cover for the illegal occupation of their nations. An interesting avenue of further analysis might be a time-series sentiment analysis of reporting about the R2P doctrine that seeks to study the impact of events such as the invasion of Afghanistan, Iraq, and more recently Ukraine on how the media discusses R2P.

5.6 Conclusions

The Rwandan genocide was brutal, violent, effective - and had broad consequences for the future of our understanding on state sovereignty, ethnicity and the relationship between state and society. The NYT, as a proxy for the mainstream media in the western world, reported before, during and after on the genocidal event. They captured the sentiment surrounding the events, as evidenced by my brief analysis. However, there were some consequential gaps in their reporting - namely in the covering of sexual violence, which mean that they failed to capture the whole picture of the atrocities during the time. The availability of information affected coverage, as it was dangerous for reliable information to be extracted by Rwandans or foreign individuals. The spike in coverage after the termination of the genocide speaks to this lack of reliable information. The consequences of this event were domestic, regional and transnational. At a domestic level, the destruction of civil society in Rwanda was, and will be, felt for generations. At the regional level, the spillover of violence into neighbouring countries acted as a catalyst for future conflict in Africa, namely the two Congan wars. Transnationally, the brutality and comprehensiveness of the Rwandan genocide woke the international community up to the notion that they might have a responsibility for the protection of human rights in foreign, sovereign, states as well as within their own borders. Many have argued that this responsibility to protect has had both positive and negative effects on the scope of conflict in the following two decades - but there is no debate around the Rwandan genocide’s contribution to the formulation of such a transnational policy.

Exercise 6 (9 points)

Create an SQLite database called fb-posts.sqlite and connect to it with the DBI package. Store the file posts.csv (without any editing in R) in the database as its only table posts. The table may only contain the original information from posts.csv, all computations in this exercise have to be done with SQL.

With only a single (which is the main challenge in this exercise) SQL query through DBI, replicate the output from Exercise 1 (i.e. also normalise by the max minus min based only on even months etc.). This means that the query should return the same 10 lowest normalised_clr column values with the associated screen names as in Exercise 1.

db <- dbConnect(RSQLite::SQLite(), "/Users/christycoulson/Documents/Projects/ASDS/GitHub/MSc-ASDS-22/MY472/Assignments/final-assignment-christycoulson/data/fb-posts.sqlite")

# found col_types at https://readr.tidyverse.org/reference/cols.html
posts <- read_csv("data/posts.csv", col_types = cols(date = col_character()))

dbWriteTable(db, "posts", posts, overwrite = TRUE)

dbListFields(db, "posts")
##  [1] "screen_name"    "date"           "post_type"      "message"       
##  [5] "likes_count"    "comments_count" "shares_count"   "love_count"    
##  [9] "haha_count"     "wow_count"      "angry_count"    "sad_count"     
## [13] "gender"         "type"           "party"
# Main_1 contains all of posts plus clr and Month
# Main_2 groups by screen_name and calculates normalised_based_on_even_months
# Main_3 joins tables and creates normalised_clr, arranges and limits to 10.
dbGetQuery(db,
           "with main_1 AS (SELECT *, comments_count/likes_count AS clr, CAST(SUBSTRING(date, 6, 2) AS NUMERICAL) AS Month
            FROM posts
            WHERE likes_count > 0.0
            ), 
            main_2 AS (SELECT screen_name, MAX(clr) - MIN(clr) AS normaliser_based_on_even_months
            FROM main_1
            WHERE Month % 2.0 = 0
            GROUP BY screen_name
            HAVING normaliser_based_on_even_months > 0.0
            ORDER BY normaliser_based_on_even_months DESC
            ), main_3 AS (SELECT main_1.screen_name, main_1.clr/main_2.normaliser_based_on_even_months AS normalised_clr
            FROM main_1 LEFT JOIN main_2
            ON main_1.screen_name = main_2.screen_name
            WHERE main_1.clr/main_2.normaliser_based_on_even_months > 0.0)
            SELECT * from main_3
            ORDER BY normalised_clr ASC
            LIMIT 10
            ")
##                      screen_name normalised_clr
## 1  CongresswomanSheilaJacksonLee    0.001395056
## 2  CongresswomanSheilaJacksonLee    0.002142407
## 3  CongresswomanSheilaJacksonLee    0.002181360
## 4                   SenDuckworth    0.002318900
## 5  CongresswomanSheilaJacksonLee    0.002768649
## 6  CongresswomanSheilaJacksonLee    0.002856543
## 7  CongresswomanSheilaJacksonLee    0.002950200
## 8                   SenDuckworth    0.003425708
## 9                      RepMullin    0.003427005
## 10 CongresswomanSheilaJacksonLee    0.003570678

Exercise 7 (25 points)

The goal of this last exercise is to develop an own project, now also based on data of your choice rather than a pre-selected data source. It thereby involves to read into new APIs and to build a relational database on the way before analysing the collected data. To help with the task, the exercise breaks it down into steps. You will first obtain data through APIs, organise and store it in a relational database, and then examine and present the topic through computations and visualisations.

  1. Read through this list of APIs mentioned in the lecture and choose any of the API(s) in which you are particularly interested. Make sure that the API(s) you choose contain data for a coherent analysis later on. Also read the documentation of the API(s).

  2. Obtain the relevant data from your chosen API(s) through R with httr (in case you cannot get the code to run for your chosen API with httr, you can also use pre-built packages to proceed with the remaining parts). Then process the data in R e.g. with the typical tidyverse functionalities.

  3. Create a SQLite database, and store at least two tables into a well structured and thought-out relational database (no need to store any data in the database that you do not need in your later analysis). Run SQL queries which return the first five rows and all columns of your tables (to give the reader a preview of what you have collected). Also run SQL queries to return the total amount of rows of your tables.

  4. Demonstrate which of the tables can be joined. Return only the first five rows of the joined tables, and also return the total number of rows in the joined tables with a query.

  5. Now query the database with SQL to obtain data for the main analysis of this exercise. Afterwards you can do all subsequent steps with R. Analyse and illustrate your data e.g. numerically with packages such as dplyr, through visualisations based on ggplot2 or plotly, quantitative text analysis with quanteda, or any other packages that are helpful. Also motivate and describe your analysis through markdown texts.

Exercise 7: An analysis of Crime and Police Practices in London

I have chosen to analyse crime and stop and search data in central London for the month of October 2022. I chose this date because it contains the most recent full dataset for a single month. I chose central London because of the relative inequality and diversity in london (https://www.trustforlondon.org.uk/data/income-inequality-over-time/#:~:text=In%202019%2F20%20those%20in,over%20the%20last%2020%20years.) and the theory of association between wealth inequality and crime (https://link.springer.com/article/10.1007/s00148-015-0579-3 & https://equalitytrust.org.uk/crime#:~:text=Economic%20inequality%20affects%20violence%20by,incites%20fear%2C%20violence%20and%20murder.). Given the massive population of London, I have chosen to include data that is not geo-located but is processed by the Metropolitan Police Department, and those crimes that occur within a 1 mile radius of Leicester Square, a centrepiece of London. However, my tables will be structured and named in a scaleable way which means that future analysts can make changes to API calls and to focus on different areas and times.

I will use the U.K. Police API to process data on crime, stop and search and outcomes of crimes. They will be stored in tables as such:

  1. Crime
  2. Crime Outcomes
  3. Crime Locations
  4. Stop and Search by Police Force
  5. Stop and Search by locale (subset of no.4 for more granular analysis)

This will facilitate scaleable analysis for future researchers.

# Crime handled by metropolitan police force within 1 mile radius of Leicester Square in October 2022
# Called crime_loc so that user can change api details to focus on other lat/lon values.
crime_loc <- GET("https://data.police.uk/api/crimes-street/all-crime?lat=51.510046&lng=-0.127685&date=2022-10") %>%
    content("text") %>%
  fromJSON()

Sys.sleep(30)

# Crime handled by metropolitan police force with no location tag in October 2022.
# Called crime_force so that user can change api details to focus on other force and dates.
crime_force <- GET("https://data.police.uk/api/crimes-no-location?category=all-crime&force=metropolitan&date=2022-10") %>%
    content("text") %>%
  fromJSON()

Sys.sleep(30)

# Crime outcomes for Crimes handled by metropolitan police force within 1 mile radius of Leicester Square.
# Outcomes are given for those crimes that were ALLOCATED outcomes in October and November of 2022.
# Called crime_outcomes_loc so that user can change api details to focus on other lat/lon values.
crime_outcomes_loc_10 <- GET("https://data.police.uk/api/outcomes-at-location?date=2022-10&lat=51.510046&lng=-0.127685") %>%
  content("text") %>%
  fromJSON() %>%
  unnest(category) %>%
  unnest(crime) %>%
  unnest(location) %>%
  unnest(street, names_repair = "unique")

Sys.sleep(30)

crime_outcomes_loc_11 <- GET("https://data.police.uk/api/outcomes-at-location?date=2022-11&lat=51.510046&lng=-0.127685") %>%
  content("text") %>%
  fromJSON() %>%
  unnest(category) %>%
  unnest(crime) %>%
  unnest(location) %>%
  unnest(street, names_repair = "unique")

crime_outcomes_loc <- rbind(crime_outcomes_loc_10, crime_outcomes_loc_11)

Sys.sleep(30)

# stop and search handled by metropolitan police force within 1 mile radius of Leicester Square in October 2022
# Called sas_loc so that user can change api details to focus on other lat/lon values.
sas_loc <- GET("https://data.police.uk/api/stops-street?lat=51.510046&lng=-0.127685&date=2022-10") %>%
  content("text") %>%
  fromJSON()

Sys.sleep(30)

# stop and search handled by metropolitan police
# Called sas_force so that user can change api details to focus on stop and searches conducted by other police forces at different dates.
sas_force <- GET("https://data.police.uk/api/stops-force?force=metropolitan&date=2022-10") %>%
  content("text") %>%
  fromJSON()

Sys.sleep(30)

# geolocational data that extracts neighbourhoods and their id's that are administered by a specific police force.
neighbourhoods_force <- GET("https://data.police.uk/api/metropolitan/neighbourhoods") %>%
    content("text") %>% 
    fromJSON(flatten = FALSE)

7.2: Data wrangling for storage.

For part 2, I want to categorise different groups of information into separate tables whilst allowing for these tables to be connected for a full analysis. This means we can be more selective with the information that we analyse, reducing computational costs when only a subsection of the columns are required.

I will create the following tables, with the following columns:

crime: (Contains crime data that combines non-locational data by Force (Metropolitan in this case) and data on crimes that are within a 1 mile radius of a specified location (Leicester square, in the case of our data.))

Fields: Crime_id (unique ID of crime attributed by API, can be given before outcome is decided), persistent_id (unique ID of crime, once attributed after outcome, unique to Police), Crime_type (type of crime), month (month crime was committed).

crime_location_info: Contains locational data for crimes.

Fields: Crime_id, persistent_id, location_type (how was this location determined in data), location_subtype, latitude, longitude, street_id, street_name

crime_outcomes_loc: Outcomes of crimes within a mile radius of a specific longitude/latitude and date combination.

Fields: Crime_id, persistent_id, Outcome of crime, more information on outcome of crime, date of outcome, person_id associated with crime, type of location where crime was committed (not available in location table).

sas_force: Stop and Search data from specific police force in a given month.

Fields: outcome of Sas, did this SaS involve a person, ethnicity defined by suspect, gender of suspect, legislation used to SaS, was outcome linked to why SaS was carried out, date & time, was the person searched to the point that their outer clothing was removed, latitude, longitude, street_id, street_name, was SaS a part of an operation, name of operation (if applicable), ethnicity of suspect as judged by police, type of SaS, object being searched for, index (unique in our database)

sas_loc: Stop and search data from specific police force in a given month

Fields: outcome of Sas, did this SaS involve a person, ethnicity defined by suspect, gender of suspect, legislation used to SaS, was outcome linked to why SaS was carried out, date & time, was the person searched to the point that their outer clothing was removed, latitude, longitude, street_id, street_name, was SaS a part of an operation, name of operation (if applicable), ethnicity of suspect as judged by police, type of SaS, object being searched for

neighbourhoods_force: Includes neighbourhood and neighbourhood IDs that are administered by specified police force.

Fields: Neighbourhood, Neighbourhood ID

Columns that might be useful but are currently unpopulated have been kept as there may be incoming information in those fields that is useful, dependant on the API specifications. The locational and outcome information was retained in the same table for Stop and Searches due to the Police not generating a unique ID for each occurrence. Thus, it would be dangerous and add risk in the form of human or parsing errors if I was to engineer a unique ID when it is not necessary for this analysis.

# First, let's create the crime database. I want to remove the location information and store that in a separate table so that the data can be sliced by location by analysts. 
crime_force <- crime_force %>%
  select(category, persistent_id, id, month) 

# Separate Location Data, contains id of crime, persistent_id of crime, latitude, longitude, street id, street name,  location_type and location_subtype. This is joined to crime table by id = id 
crime_location_info <- crime_loc %>%
  unnest(location) %>%
  unnest(street, names_repair = "unique") %>%
  select(location_type, location_subtype, id...4, name, persistent_id, latitude, longitude, id...10) %>%
  rename(street_id = id...4, Crime_id = id...10, street_name = name)

# Remove location data from crime_loc
crime_loc <- crime_loc %>%
  select(category, persistent_id, id, month)

# Crime table 
# just contains crime_id, persistent_id (constant id), type of crime and the month that the crime occurred in. 
crime <- rbind(crime_force, crime_loc) %>%
  rename(Crime_type = category, crime_id = id) %>%
  arrange(desc(crime_id))

rm(crime_force, crime_loc)

# Crime outcomes 
# need to unnest() as we're dealing with nested data as output from API.
# Contains outcome of crime, more descriptive info, the id of person arrested, persistent_id of crime and id of crime. 
crime_outcomes_loc <- crime_outcomes_loc %>%
  select(code, name...2, date, person_id, persistent_id, id...13, location_subtype) %>%
  rename(outcome = code, outcome_info = name...2, crime_id = id...13, date_of_outcome = date)

rm(crime_outcomes_loc_10, crime_outcomes_loc_11)

# Stop and search by force name is already formatted correctly. 
# Stop and search by force does not have a unique ID so i will create one so that we can identify unique instances. 
# However, this index CANNOT interact with sas_loc. Thus, these two databases should not be used together because it is possible that occurrences in one of the databases may occur in the other database. 
sas_force$index <- 1:nrow(sas_force)

# remove duplicate outcome
sas_force <- sas_force %>% 
  unnest(location) %>%
  unnest(street, names_repair = "unique") %>%
  select(-outcome_object) %>%
  rename(street_id = id, street_name = name)

# Change values now for future data visualisation
sas_force$type[sas_force$type == "Person and Vehicle search"] <- "Pers & Veh"
sas_force$type[sas_force$type=="Person search"] <- "Person"
sas_force$type[sas_force$type=="Vehicle search"] <- "Vehicle"

sas_loc <- sas_loc %>% 
  unnest(location) %>%
  unnest(street, names_repair = "unique") %>%
  select(-outcome_object) %>%
  rename(street_id = id, street_name = name)

neighbourhoods_force <- neighbourhoods_force %>%
  rename(neighourhood_id = id, neighbourhood = name)
  1. 7.3: Writing tables to DB and illustrating content.
policing_db <- dbConnect(RSQLite::SQLite(), "/Users/christycoulson/Documents/Projects/ASDS/GitHub/MSc-ASDS-22/MY472/Assignments/final-assignment-christycoulson/data/policing_db.sqlite")

# Write crime table into policing_db 
dbWriteTable(policing_db, "crime", crime, overwrite = TRUE)

# Write crime_location_info into policing_db
dbWriteTable(policing_db, "crime_location_info", crime_location_info, overwrite = TRUE)

# Write crime_outcomes_loc into policing_db
dbWriteTable(policing_db, "crime_outcomes_loc", crime_outcomes_loc, overwrite = TRUE)

# Write sas_force into policing_db
dbWriteTable(policing_db, "sas_force", sas_force, overwrite = TRUE)

# Write sas_loc into policing_db
dbWriteTable(policing_db, "sas_loc", sas_loc, overwrite = TRUE)


# crime table
dbListFields(policing_db, "crime")
## [1] "Crime_type"    "persistent_id" "crime_id"      "month"
dbGetQuery(policing_db,
           "SELECT * 
           FROM crime
           LIMIT 5
            ")
##              Crime_type persistent_id  crime_id   month
## 1 theft-from-the-person               106776267 2022-10
## 2 theft-from-the-person               106776266 2022-10
## 3 theft-from-the-person               106776265 2022-10
## 4 theft-from-the-person               106776264 2022-10
## 5 theft-from-the-person               106776263 2022-10
dbGetQuery(policing_db,
           "SELECT COUNT(*) as NumberOfRows
           FROM crime
            ")
##   NumberOfRows
## 1         9584
# crime_location_info table
dbListFields(policing_db, "crime_location_info")
## [1] "location_type"    "location_subtype" "street_id"        "street_name"     
## [5] "persistent_id"    "latitude"         "longitude"        "Crime_id"
dbGetQuery(policing_db,
           "SELECT * 
           FROM crime_location_info
           LIMIT 5
            ")
##   location_type location_subtype street_id                street_name
## 1         Force                    1678704   On or near Bridge Street
## 2         Force                    1678704   On or near Bridge Street
## 3         Force                    1677604                On or near 
## 4         Force                    1677447 On or near Strutton Ground
## 5         Force                    1677378   On or near Caxton Street
##   persistent_id  latitude longitude  Crime_id
## 1               51.500978 -0.125524 105785927
## 2               51.500978 -0.125524 105785928
## 3               51.516623 -0.132976 105804447
## 4               51.497497 -0.133922 105804448
## 5               51.498553 -0.135291 105804459
dbGetQuery(policing_db,
           "SELECT COUNT(*) as NumberOfRows
           FROM crime_location_info
            ")
##   NumberOfRows
## 1         7401
# crime_outcomes_loc table
dbListFields(policing_db, "crime_outcomes_loc")
## [1] "outcome"          "outcome_info"     "date_of_outcome"  "person_id"       
## [5] "persistent_id"    "crime_id"         "location_subtype"
dbGetQuery(policing_db,
           "SELECT * 
           FROM crime_outcomes_loc
           LIMIT 5
            ")
##             outcome                                  outcome_info
## 1 no-further-action Investigation complete; no suspect identified
## 2 no-further-action Investigation complete; no suspect identified
## 3 no-further-action Investigation complete; no suspect identified
## 4 no-further-action Investigation complete; no suspect identified
## 5 no-further-action Investigation complete; no suspect identified
##   date_of_outcome person_id
## 1         2022-10        NA
## 2         2022-10        NA
## 3         2022-10        NA
## 4         2022-10        NA
## 5         2022-10        NA
##                                                      persistent_id  crime_id
## 1 caac775ddb8b83abdb7abb45acf2d88c447821a22909a8d2439f4d6edf22d60c 104381370
## 2 ba6f9d545c9e30ea8ae6d6da39c5d5eee267a8c2cfce402ac4e569a3955f8dc3 105885556
## 3 46a1964f2fa77f2a13849df008fe42351ec715f78dcb2fbb914d962c5c7bd670 105879693
## 4 c20c8cd3d1ee94a6c13efd70d7d3a17dade0999a466527a579fc0ccb33e22df8 104831252
## 5 ee1cf255d3f9542186c3768cbffd022932fc74aca7cb8fb5a157cfa4fdbba253 101632407
##   location_subtype
## 1             ROAD
## 2             ROAD
## 3             ROAD
## 4             ROAD
## 5             ROAD
dbGetQuery(policing_db,
           "SELECT COUNT(*) as NumberOfRows
           FROM crime_outcomes_loc
            ")
##   NumberOfRows
## 1         9397
# sas_force table
dbListFields(policing_db, "sas_force")
##  [1] "age_range"                           "outcome"                            
##  [3] "involved_person"                     "self_defined_ethnicity"             
##  [5] "gender"                              "legislation"                        
##  [7] "outcome_linked_to_object_of_search"  "datetime"                           
##  [9] "removal_of_more_than_outer_clothing" "latitude"                           
## [11] "street_id"                           "street_name"                        
## [13] "longitude"                           "operation"                          
## [15] "officer_defined_ethnicity"           "type"                               
## [17] "operation_name"                      "object_of_search"                   
## [19] "index"
dbGetQuery(policing_db,
           "SELECT * 
           FROM sas_force
           LIMIT 5
            ")
##   age_range                      outcome involved_person
## 1     25-34                       Arrest               1
## 2     18-24 A no further action disposal               1
## 3   over 34 A no further action disposal               1
## 4   over 34 A no further action disposal               1
## 5     25-34         Community resolution               1
##                                  self_defined_ethnicity gender
## 1                          Asian/Asian British - Indian   Male
## 2                       Asian/Asian British - Pakistani   Male
## 3       Black/African/Caribbean/Black British - African   Male
## 4                       Other ethnic group - Not stated   Male
## 5 White - English/Welsh/Scottish/Northern Irish/British   Male
##                                         legislation
## 1                    Firearms Act 1968 (section 47)
## 2                    Firearms Act 1968 (section 47)
## 3 Police and Criminal Evidence Act 1984 (section 1)
## 4 Police and Criminal Evidence Act 1984 (section 1)
## 5             Misuse of Drugs Act 1971 (section 23)
##   outcome_linked_to_object_of_search                  datetime
## 1                                 NA 2022-10-01T20:25:00+00:00
## 2                                 NA 2022-10-02T20:30:00+00:00
## 3                                 NA 2022-10-04T16:30:00+00:00
## 4                                 NA 2022-10-07T21:30:00+00:00
## 5                                 NA 2022-10-09T18:15:00+00:00
##   removal_of_more_than_outer_clothing  latitude street_id
## 1                                  NA 51.471859   1650892
## 2                                  NA 51.439107   1651983
## 3                                  NA 51.471859   1650892
## 4                                  NA 51.480591   1650423
## 5                                  NA 51.484497   1651265
##                    street_name longitude operation officer_defined_ethnicity
## 1    On or near Camberley Road -0.455397         0                     Asian
## 2 On or near St Dunstan'S Road -0.420848         0                     Asian
## 3    On or near Camberley Road -0.455397         0                     Asian
## 4                  On or near  -0.474025         0                     Asian
## 5  On or near Harlington Close -0.441418         0                     White
##         type operation_name                   object_of_search index
## 1     Person             NA                           Firearms     1
## 2     Person             NA                           Firearms     2
## 3     Person             NA                  Offensive weapons     3
## 4     Person             NA Evidence of offences under the Act     4
## 5 Pers & Veh             NA                   Controlled drugs     5
dbGetQuery(policing_db,
           "SELECT COUNT(*) as NumberOfRows
           FROM sas_force
            ")
##   NumberOfRows
## 1        14359
# sas_loc table
dbListFields(policing_db, "sas_loc")
##  [1] "age_range"                           "outcome"                            
##  [3] "involved_person"                     "self_defined_ethnicity"             
##  [5] "gender"                              "legislation"                        
##  [7] "outcome_linked_to_object_of_search"  "datetime"                           
##  [9] "removal_of_more_than_outer_clothing" "latitude"                           
## [11] "street_id"                           "street_name"                        
## [13] "longitude"                           "operation"                          
## [15] "officer_defined_ethnicity"           "type"                               
## [17] "operation_name"                      "object_of_search"
dbGetQuery(policing_db,
           "SELECT * 
           FROM sas_loc
           LIMIT 5
            ")
##   age_range                      outcome involved_person
## 1     18-24 A no further action disposal               1
## 2     25-34 A no further action disposal               1
## 3      <NA> A no further action disposal               1
## 4     18-24 A no further action disposal               1
## 5   over 34 A no further action disposal               1
##                            self_defined_ethnicity gender
## 1                 Other ethnic group - Not stated   Male
## 2 Black/African/Caribbean/Black British - African   Male
## 3                 Other ethnic group - Not stated   Male
## 4                 Other ethnic group - Not stated   Male
## 5              White - Any other White background   Male
##                                         legislation
## 1             Misuse of Drugs Act 1971 (section 23)
## 2 Police and Criminal Evidence Act 1984 (section 1)
## 3 Police and Criminal Evidence Act 1984 (section 1)
## 4 Police and Criminal Evidence Act 1984 (section 1)
## 5             Misuse of Drugs Act 1971 (section 23)
##   outcome_linked_to_object_of_search                  datetime
## 1                                 NA 2022-10-01T00:55:00+00:00
## 2                                 NA 2022-10-01T03:55:00+00:00
## 3                                 NA 2022-10-01T02:32:00+00:00
## 4                                 NA 2022-10-01T00:50:00+00:00
## 5                                 NA 2022-10-01T05:34:00+00:00
##   removal_of_more_than_outer_clothing  latitude street_id
## 1                                  NA 51.513804   1678045
## 2                                  NA 51.510373   1679248
## 3                                  NA 51.513970   1677442
## 4                                  NA 51.513804   1678045
## 5                                  NA 51.511339   1678405
##                   street_name longitude operation officer_defined_ethnicity
## 1     On or near Greek Street -0.130733         0                     Other
## 2      On or near Savoy Court -0.120887         0                     Black
## 3   On or near Wardour Street -0.134387         0                     Black
## 4     On or near Greek Street -0.130733         0                     Black
## 5 On or near Cranbourn Street -0.128427         0                     White
##            type operation_name object_of_search
## 1 Person search             NA Controlled drugs
## 2 Person search             NA     Stolen goods
## 3 Person search             NA     Stolen goods
## 4 Person search             NA     Stolen goods
## 5 Person search             NA Controlled drugs
dbGetQuery(policing_db,
           "SELECT COUNT(*) as NumberOfRows
           FROM sas_loc
            ")
##   NumberOfRows
## 1         1105
  1. 7.4: Joining of Tables

The primary key for the crime table is crime.crime_id. It exists as a foreign key in crime_location_info and crime_outcomes_loc tables. Joins are made by joining the crime table to both of those tables. A left join will maintain all the crime occurrences in the crime table.

An inner join relative to either of the other two tables will limit the rows to those with locational data, and those with outcome data.

It is imperative that the sas_force and sas_loc tables are NOT joined, as there is no police-provided unique ID and thus they need to be analysed separately in fear of duplicating data in analyses.

# Crime tables completely joined 

dbGetQuery(policing_db,
           "SELECT *
            FROM crime 
            LEFT JOIN crime_location_info loc
            ON crime.crime_id = loc.Crime_id
            LEFT JOIN crime_outcomes_loc outc
            ON crime.crime_id = outc.crime_id
            LIMIT 5
            ")
##              Crime_type persistent_id  crime_id   month location_type
## 1 theft-from-the-person               106776267 2022-10           BTP
## 2 theft-from-the-person               106776266 2022-10           BTP
## 3 theft-from-the-person               106776265 2022-10           BTP
## 4 theft-from-the-person               106776264 2022-10           BTP
## 5 theft-from-the-person               106776263 2022-10           BTP
##   location_subtype street_id             street_name persistent_id  latitude
## 1       LU STATION   2343757 Green Park (lu Station)               51.506954
## 2       LU STATION   2343757 Green Park (lu Station)               51.506954
## 3       LU STATION   2343757 Green Park (lu Station)               51.506954
## 4       LU STATION   2343757 Green Park (lu Station)               51.506954
## 5       LU STATION   2343757 Green Park (lu Station)               51.506954
##   longitude  Crime_id outcome outcome_info date_of_outcome person_id
## 1 -0.142080 106776267    <NA>         <NA>            <NA>        NA
## 2 -0.142080 106776266    <NA>         <NA>            <NA>        NA
## 3 -0.142080 106776265    <NA>         <NA>            <NA>        NA
## 4 -0.142080 106776264    <NA>         <NA>            <NA>        NA
## 5 -0.142080 106776263    <NA>         <NA>            <NA>        NA
##   persistent_id crime_id location_subtype
## 1          <NA>       NA             <NA>
## 2          <NA>       NA             <NA>
## 3          <NA>       NA             <NA>
## 4          <NA>       NA             <NA>
## 5          <NA>       NA             <NA>
dbGetQuery(policing_db,
           "SELECT COUNT(*) as NumberOfRows
            FROM crime 
            LEFT JOIN crime_location_info loc
            ON crime.crime_id = loc.Crime_id
            LEFT JOIN crime_outcomes_loc outc
            ON crime.crime_id = outc.crime_id
            ")
##   NumberOfRows
## 1         9781

7.5: An Analysis of Crime and Policing in London

With this analysis, I hope to uncover patterns and points of interest in policing practices and crime in London. I will apply social theory to my empirics. However, due to only taking a sample, defined by time (October 2022) and geospatiality (1 mile radius from Leicester square and incidents without geolocation information that were administered to by met police force), there may be sampling error. Unfortunately, I don’t have access to information regarding the total numbers, and if I was able to access said information - I could not be sure these instances were operationalised the same. Thus, I will treat my sample as if it were a population and there is an explicit sampling bias built into these findings. First, I will try to generate explanations regarding the whole of London. Then, I will triangulate on specific areas for a more acute analysis.

Let’s begin with total numbers during the month of October 2022.

crime <- dbGetQuery(policing_db,
           "SELECT *
            FROM crime 
            LEFT JOIN crime_location_info loc
            ON crime.crime_id = loc.Crime_id
            LEFT JOIN crime_outcomes_loc outc
            ON crime.crime_id = outc.crime_id
            ")

crime <- crime[-c(6,9,17,18)]

sas_met <- dbGetQuery(policing_db,
           "SELECT * 
           FROM sas_force
           ")
paste("There are", n_distinct(crime$crime_id), "crime occurrences in my database for October 2022.")
## [1] "There are 9584 crime occurrences in my database for October 2022."
paste("There are", n_distinct(sas_met), "Stop and Searches in my database for October 2022.")
## [1] "There are 14359 Stop and Searches in my database for October 2022."
crime %>% 
  group_by(Crime_type) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
## # A tibble: 14 × 2
##    Crime_type            count
##    <chr>                 <int>
##  1 other-theft            2585
##  2 theft-from-the-person  2148
##  3 anti-social-behaviour  1197
##  4 violent-crime          1158
##  5 vehicle-crime           425
##  6 drugs                   382
##  7 public-order            382
##  8 robbery                 377
##  9 shoplifting             363
## 10 burglary                290
## 11 criminal-damage-arson   252
## 12 bicycle-theft           161
## 13 possession-of-weapons    33
## 14 other-crime              28
# create blank theme for visualisations
blank_theme <- theme_minimal() +
  theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.border = element_blank(),
  panel.grid= element_blank(),
  axis.ticks = element_blank(),
  plot.title= element_text(size=12, family = "serif", face = "bold"),
  plot.subtitle = element_text(size = 9, family = "serif")
  )

crime %>% 
  group_by(Crime_type) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = reorder(Crime_type, (-count)), y = count, fill = Crime_type)) +
  geom_bar(stat = 'identity') +
  geom_label_repel(aes(label = Crime_type),
                   max.overlaps = 10,
                   box.padding = 0.2,
                   point.padding = 1,
                   segment.color = 'grey13',
                   show.legend = FALSE) +
  guides(fill = guide_legend(title = "Type of Crime",
                             override.aes = list(size = 2, shape = 10),
                             title.theme = element_text(face = "bold", size = 9))) +
  labs(title = "Number of crimes in October 2022 in London Sample per Crime Type",
       subtitle = "Theft remains the most common crime, with anti-social behaviour and violent crime also still common ") +
  xlab("Type of Crime") +
  ylab("Number of Incidences") +
  theme_clean() +
  theme(plot.title = element_text(size = 12, family = "serif", face = "bold"), 
        plot.subtitle = element_text(size = 10, family = "serif"),
        axis.text.x=element_blank(),
        axis.ticks.x = element_blank(),
        legend.justification = "top")

Types and Occurrences of Crime

Crime in London, October 2022, was varied in type. Theft remained the most common crime, accounting for 4805 of 9584 instances (50.1%). This was followed by anti-social behaviour and violent crime, with 1197 (12.5%) and 1152 (12.0%) incidents respectively. Together, these crimes accounted for almost 75% of all crime within a mile radius of Leicester Square or without geolocational data for October 2022 in London. Interestingly, drugs only accounts for 377 (3.9%) of all arrests. We will revisit this later when we compare this with the number of Stop and Searches concerning drug possession and use. Given that there were 9584 registered crimes in my sample, let’s see how these instances were resolved.

crime %>%
  group_by(outcome) %>%
  summarise(count = n_distinct(crime_id)) %>%
  arrange(desc(count))
## # A tibble: 7 × 2
##   outcome               count
##   <chr>                 <int>
## 1 <NA>                   7053
## 2 no-further-action      2262
## 3 local-resolution        146
## 4 awaiting-court-result    86
## 5 cautioned                23
## 6 penalty-notice-issued    13
## 7 unable-to-prosecute       4
crime %>%
  group_by(date_of_outcome) %>%
  summarise(count = n_distinct(crime_id)) %>%
  arrange(desc(count))
## # A tibble: 3 × 2
##   date_of_outcome count
##   <chr>           <int>
## 1 <NA>             7053
## 2 2022-10          1891
## 3 2022-11           650
crime %>% 
  filter(!is.na(outcome)) %>%
  group_by(outcome) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = "", y=count, fill = outcome)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start=0) +
  labs(title = "Outcomes of Crimes Committed in Oct 2022 that were Resolved in Oct/Nov 2022",
       subtitle = "The majority of crimes committed resolved in 'no further action'") +
  scale_fill_viridis_d(option = "plasma") +
  blank_theme

Outcomes of Crime

I chose to include both October and November (Most recent) outcome data in order to maximise our sample given that we’re looking at crimes committed in October 2022 and there is often a period of administrative and deliberative delay to outcomes in even mildly complex criminal cases. Only 1892 of 9584 (19.7%) crimes occurring in October 2022 were also resolved in October 2022. 650 (6.8%) of crimes committed in October 2022 received an outcome in November 2022.

As seen above, the large majority of those crimes that had an outcome ended in no further action. In fact, a huge 89% ended this way. The lopsidedness of these incidents per outcome begs a question on police practice - if 89% of cases are ending in no further action, are we spending too great a proportion of our police resources on menial arrests with little chargeability? This inital analysis of outcomes, although simple, has provided an early interesting avenue of further inquiry whereby researchers could examine the impact of police focus on differing types of crime on outcomes, and the subsequent impact that has on overall crime figures.

Next, let’s look at what we can discern from the locational data available about the patterns of crime and policing.

crime %>%
  filter(!is.na(location_subtype)) %>%
  group_by(location_subtype) %>%
  summarise(count = n_distinct(crime_id), proportion = count/2531) %>%
  arrange(desc(count))
## # A tibble: 19 × 3
##    location_subtype                             count proportion
##    <chr>                                        <int>      <dbl>
##  1 ROAD                                          1685   0.666   
##  2 NIGHTCLUBS                                     314   0.124   
##  3 THEATRES AND CONCERT HALLS                     158   0.0624  
##  4 FURTHER EDUCATION ESTABLISHMENTS               104   0.0411  
##  5 PARKING                                         87   0.0344  
##  6 DEPARTMENT STORES                               54   0.0213  
##  7 CONFERENCE AND EXHIBITION CENTRES               24   0.00948 
##  8 HIGHER EDUCATION ESTABLISHMENTS                 22   0.00869 
##  9 SPORTS GROUNDS, STADIA AND PITCHES              20   0.00790 
## 10 HOSPITALS                                       11   0.00435 
## 11 THEME AND ADVENTURE PARKS                       11   0.00435 
## 12 LAKES AND WATERS                                 9   0.00356 
## 13 MARKETS                                          8   0.00316 
## 14 POLICE STATIONS                                  8   0.00316 
## 15 SUPERMARKET CHAINS                               8   0.00316 
## 16 ACCIDENT AND EMERGENCY HOSPITALS                 3   0.00119 
## 17 RACECOURSES AND GREYHOUND TRACKS                 3   0.00119 
## 18 BUS AND COACH STATIONS, DEPOTS AND COMPANIES     1   0.000395
## 19 FERRIES AND FERRY TERMINALS                      1   0.000395
crime %>%
  filter(!is.na(location_subtype)) %>%
  group_by(location_subtype, date_of_outcome) %>%
  summarise(count = n()) %>%
  filter(count > 19) %>%
  arrange(desc(count)) %>%
  mutate(location_subtype = factor(location_subtype)) %>%
  ggplot(aes(x = reorder(location_subtype, (-count)), y = count, fill = location_subtype)) +
  geom_bar(position = "stack", stat = "identity") +
  guides(fill = guide_legend(title = "Location of Crime")) +
  facet_grid2(~date_of_outcome, scales = "free_y", independent = "y") +
  labs(title = "Crimes Committed by Location Type in London in October and November",
       subtitle = "There is a similar distribution of crimes per location type over time, with the majority occurring on roads") +
  ylab("Number of Crimes") +
  scale_fill_viridis_d(option = "plasma") +
  theme_clean() +
  theme(plot.title = element_text(size = 12, family = "serif", face = "bold"), 
        plot.subtitle = element_text(size = 10, family = "serif"),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank())

Crime by Location

The fact that the majority of our crimes were committed within a 1-mile radius of Leicester square diminishes the analytical utility of a lon/lat geospatial analysis with R due to an inability to display granularity to the degree required. However, it is possible to identify hot spots for criminal activity by the type of location in our localised area. This helps address the research question “What types of places are most crimes committed in London?” Only crimes that received outcomes have this value.

The first thing to note here is that the proportional distributions for both October and November are similar. Despite differing count values, the shape of the graph remains the same. 1685 of the 2531 (66.6%) crimes committed with this information available were conducted on the road. This was by far the most popular place for crime to be committed in London. The next most common place for crime was in nightclubs, with 314 (12.4%) of crime being committed there. These findings align with general ‘common sense’ understandings of crime in London - nightclubs can be dangerous and don’t wander around on your own late at night. With additional information, like socio-economic status, age and ethnicity, we might be able to draw some socio-cultural conclusions around preference for crime locations. However, in the absence of such information, we must be sure not to draw extrapolated conclusions from our data. Crime is interesting to a number of people. However, the sociological questions that are of most interest are those such as ‘Why did X commit this crime?,’ ‘What can we do to stop this crime?,’ and ‘Should this be a crime?.’ We are not able to answer any of these questions with the Police API data.

Stop and Searches

We are, however, able to perform a more comprehensive analysis on stop and searches. stop and searches were introduced in 1984 in the United Kingdom under the ‘sus’ law. Stop and search originally meant the police could stop, search, and subsequently arrest a ‘suspicious’ person without warrant, reason or evidence. This was repealed in 1981 due to controversy around racial biases in the implementation of this law in black communities. Now, officers are required to have ‘reasonable suspicion.’ The stop and search protocols are still controversial today due to suspected gendered, aged, and ethnic biases when implemented. Recent research on the subject can be found here:

https://www.libertyhumanrights.org.uk/advice_information/stop-and-search/ , https://www.justiceinspectorates.gov.uk/hmicfrs/wp-content/uploads/disproportionate-use-of-police-powers-spotlight-on-stop-search-and-use-of-force.pdf

We will explore the reasons given for stop and searches, their outcomes, when they occur, and which age, genders and ethnicities they affect the most. This analysis will look at stop and searches conducted by the Metropolitan Police Force in London during October 2022. There were 14359 stop and searches that occurred during this time.

# Count of object of search
sas_met %>%
  filter(!is.na(object_of_search)) %>%
  group_by(object_of_search) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
## # A tibble: 8 × 2
##   object_of_search                    count
##   <chr>                               <int>
## 1 Controlled drugs                     8694
## 2 Offensive weapons                    2330
## 3 Stolen goods                         1989
## 4 Evidence of offences under the Act    525
## 5 Fireworks                             435
## 6 Articles for use in criminal damage   188
## 7 Firearms                              112
## 8 Anything to threaten or harm anyone    55
# Count on object of search and outcome
sas_met %>%
  filter(!is.na(object_of_search)) %>%
  group_by(object_of_search, outcome) %>%
  summarise(count = n()) %>%
  arrange(desc(count))
## # A tibble: 43 × 3
## # Groups:   object_of_search [8]
##    object_of_search                   outcome                      count
##    <chr>                              <chr>                        <int>
##  1 Controlled drugs                   A no further action disposal  6007
##  2 Offensive weapons                  A no further action disposal  1792
##  3 Stolen goods                       A no further action disposal  1366
##  4 Controlled drugs                   Community resolution          1171
##  5 Controlled drugs                   Arrest                         962
##  6 Stolen goods                       Arrest                         459
##  7 Offensive weapons                  Arrest                         409
##  8 Fireworks                          A no further action disposal   396
##  9 Evidence of offences under the Act A no further action disposal   358
## 10 Controlled drugs                   Penalty Notice for Disorder    339
## # … with 33 more rows
sas_met %>%
  filter(!is.na(object_of_search)) %>%
  group_by(object_of_search, type) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = reorder(object_of_search, (-count)), y = count, fill = object_of_search)) +
  geom_bar(position = "stack", stat = "identity") +
  guides(fill = guide_legend(title = "Object of Search")) +
  facet_grid2(~type, scales = "free_y", independent = "y") +
  labs(title = "Stop and Searches by Object of Search and Type",
       subtitle = "Drugs are by far the most searched for item, irrespective of whether the suspect is in a vehicle") +
  ylab("Number of SaS") +
  scale_fill_viridis_d(option = "plasma") +
  theme_clean() +
  theme(plot.title = element_text(size = 12, family = "serif", face = "bold"), 
        plot.subtitle = element_text(size = 10, family = "serif"),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        legend.position = "right")

Reasons for Stop and Searches

The object most searched for during stop and searches is drugs. This accounted for 8694 of 14359 (60.5%) of the total stop and searches in October 2022. This is irrespective of whether a person, vehicle, or person and vehicle are being stopped and searched. This is interesting when compared to our crime statistics above, where only 3.7% of crimes committed were associated with drugs - and almost all of those cases required no further action. There seems to be a disconnect, then, between what officers feel warrants reasonable suspicion for a stop and search and what suspects are actually arrested and convicted for. After drugs, most stop and searches occur in the search for weaponry, 2330 (16.2%), and stolen goods, 1989 (13.9%). In the case of drugs, only 962 of 8694 (11.1%) stop and searches result in arrest. The proportion of stop and searches that ends in arrest is much greater for those searching for weapons and stolen goods. For weapons, the arrest rate is 17.6%. For stolen goods, the arrest rate is 23.1%. This disparity in arrest rates, when coupled with the drug-based crime statistics, is evidence that a disproportionate amount of stop and searches are conducted looking for ‘drugs’ where no offence has occurred. Sociologically, this has a number of potential explanations. We do not have the means at our disposal to identify causal relationships, but sociologists have theorised that police officers use ‘drugs’ as a scapegoat for simply enabling the stopping and searching of individuals. Others have argued that stopping and searching for drugs, especially in London, is simply a means of boosting workload statistics. Are there any temporal patterns in when stop and searches are being conducted?

sas_met %>%
  filter(!is.na(object_of_search)) %>%
  ggplot() +
  geom_bar(aes(x = datetime %>% strftime(format = "%j"), fill = object_of_search), stat = "count") +
  geom_hline(yintercept = 448.7188) +
  geom_text(aes(x= 4, y = 449,label = "Daily\nAverage"), size = 2.5, vjust = -0.4,  col = "black") +
  scale_fill_viridis_d(option = "plasma") +
  guides(fill = guide_legend(title = "Object of Search")) +
  labs(title = "Stop and Searches by Week of the Year in October",
       subtitle = "No significant temporal changes week-on-week for stop and searches") +
  ylab("Number of SaS") +
  scale_fill_viridis_d(option = "plasma") +
  theme_clean() +
  theme(plot.title = element_text(size = 12, family = "serif", face = "bold"), 
        plot.subtitle = element_text(size = 10, family = "serif"),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        legend.position = "right")

Time & Stop and Searches.

There is natural deviation in the number of stop and searches conducted as the week commences, but the levels fluctuate around the average number per day, which is shown with the black horizontal line. There are ‘breakouts,’ like in the last day of the above graphic, but generally the levels revert to the mean. The most interesting thing to derive from this graphic is that every 6/7 days there is a relative spike compared to the prior 5/6 days. This is the weekend, and is the time period during the week where most stop and searches are conducted. This is interesting, but expected. Moving away from temporality due to a lack of challenging findings, what about stop and searches generally, what are the general outcomes? Are they worth it?

# Number of stop and searches per Outcome
sas_met %>%
  group_by(outcome) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  arrange(desc(count))
## # A tibble: 6 × 3
##   outcome                         count proportion
##   <chr>                           <int>      <dbl>
## 1 A no further action disposal    10177    0.709  
## 2 Arrest                           2075    0.145  
## 3 Community resolution             1393    0.0970 
## 4 Penalty Notice for Disorder       439    0.0306 
## 5 Summons / charged by post         248    0.0173 
## 6 Caution (simple or conditional)    27    0.00188
# Object of search for those ending in arrest.

sas_met %>%
  filter(outcome == "Arrest") %>%
  group_by(object_of_search) %>%
  summarise(count = n(), proportion = count/2075) %>%
  arrange(desc(count))
## # A tibble: 9 × 3
##   object_of_search                    count proportion
##   <chr>                               <int>      <dbl>
## 1 Controlled drugs                      962    0.464  
## 2 Stolen goods                          459    0.221  
## 3 Offensive weapons                     409    0.197  
## 4 Evidence of offences under the Act    131    0.0631 
## 5 Articles for use in criminal damage    60    0.0289 
## 6 Firearms                               27    0.0130 
## 7 Fireworks                              16    0.00771
## 8 <NA>                                    6    0.00289
## 9 Anything to threaten or harm anyone     5    0.00241

Outcomes of Stop and Searches

70.9% of all stop and searches by the Metropolitan Police Force in October 2022 resulted in no further action. This enormous number begs the question of what constitutes ‘reasonable suspicion of an offence’ that is required to perform a stop and search. If 70.9% are resulting in no further action, could that resource not be spent elsewhere? Or, oppositionally, is there a deterrent effect from stop and searches that the police force believe to limit the number and scope of crimes? This is an interesting avenue for further inquiry.

2075 (14.5%) resulted in arrest. Most arrests as a result of stop and searches are a result of drug possession. However, it is difficult to disentangle whether this is because there are a disproportionately large number of stop and searches that are looking for drugs, or whether drug possession is indeed the most populous crime. Cross-referencing with drug crime data found earlier in this analysis would suggest the former is the case, as a low proportion of overall crime can be attributed to drugs. Stolen goods and weapon searches are the object reason for the next most populous arrested groups.

This illustration becomes increasingly more interesting as we start to intersect these numbers with Age, Gender and Ethnicity - both in terms of the number of stops and searches conducted, but also for the outcomes of said incidences.

# Number of SaS by Gender
sas_met %>%
  group_by(gender) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  arrange(desc(count)) 
## # A tibble: 4 × 3
##   gender count proportion
##   <chr>  <int>      <dbl>
## 1 Male   13007    0.906  
## 2 Female  1170    0.0815 
## 3 <NA>     165    0.0115 
## 4 Other     17    0.00118
# 0.905843025 male.

# Number of SaS by Age and Gender
sas_met %>%
  group_by(age_range, gender) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  filter(gender %in% c("Female","Male") & !is.na(age_range) & age_range != "under 10") %>%
  arrange(desc(count)) %>%
  ggplot +
  geom_bar(aes(x = age_range, y = count,  fill = gender), stat = "identity") +
  guides(fill = guide_legend(title = "Gender")) +
  labs(title = "Stop and Searches by Age and Gender",
       subtitle = "18-24 the most common group to stopped and searched, with similarity in levels for other age groups") +
  ylab("Number of SaS") +
  xlab("Age Group") +
  scale_fill_manual(values=c("red", "lightblue")) +
  theme_few() +
  theme(plot.title = element_text(size = 12, family = "serif", face = "bold"), 
        plot.subtitle = element_text(size = 10, family = "serif"),
        legend.position = "right")

# Number of SaS by Age
sas_met %>%
  group_by(age_range) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  arrange(desc(count))
## # A tibble: 6 × 3
##   age_range count proportion
##   <chr>     <int>      <dbl>
## 1 18-24      3964   0.276   
## 2 25-34      2880   0.201   
## 3 over 34    2875   0.200   
## 4 10-17      2545   0.177   
## 5 <NA>       2093   0.146   
## 6 under 10      2   0.000139
# Number of SaS by Ethnicity
sas_met %>%
  group_by(officer_defined_ethnicity) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  arrange(desc(count))
## # A tibble: 5 × 3
##   officer_defined_ethnicity count proportion
##   <chr>                     <int>      <dbl>
## 1 White                      5721     0.398 
## 2 Black                      5341     0.372 
## 3 Asian                      2127     0.148 
## 4 Other                       827     0.0576
## 5 <NA>                        343     0.0239
# White people the most stopped and searched but disproportionately low compared to % of population
# Black people stopped almost as much as white people despite constituting only around 10-15% of population


# Number of SaS by Ethnicity and age
# Statistics for Asian and Black London population proportions came from: https://en.wikipedia.org/wiki/Ethnic_groups_in_London#Racial_breakdown_of_London
# according to the 2021 census. 
sas_met %>%
  group_by(officer_defined_ethnicity, age_range) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  filter(!is.na(officer_defined_ethnicity) & !is.na(age_range) & age_range != "under 10") %>%
  arrange(desc(count)) %>%
  ggplot +
  geom_bar(aes(x = reorder(officer_defined_ethnicity, count), y = proportion,  fill = age_range), stat = "identity") +
  geom_hline(yintercept = 0.1332) +
  geom_text(aes(x = 1, y = 0.135, label= "Black London %"), size = 2.5, vjust = -0.2, col = "black") +
  geom_hline(yintercept = 0.2066) +
  geom_text(aes(x = 2, y = 0.21, label= "Asian London %"), size = 2.5, vjust = -0.2, col = "black") +
  guides(fill = guide_legend(title = "Age Range")) +
  labs(title = "Stop and Searches by Ethnicity and Age Group in London",
       subtitle = "Black people stopped and searched at a disproportionately high rate compared to White and Asian groups") +
  ylab("Proportion of Total SaS") +
  xlab("Officer Defined Ethnicity") +
  scale_fill_brewer(palette = "Dark2") +
  theme_few() +
  theme(plot.title = element_text(size = 12, family = "serif", face = "bold"), 
        plot.subtitle = element_text(size = 9, family = "serif"),
        legend.position = "right")

## Number of SaS by age as proportion of total ethnicity numbers
sas_met %>%
  group_by(officer_defined_ethnicity, age_range) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  filter(!is.na(officer_defined_ethnicity) & !is.na(age_range) & age_range != "under 10") %>%
  arrange(desc(count)) %>%
  ggplot +
  geom_bar(aes(x = reorder(officer_defined_ethnicity, count), y = proportion,  fill = age_range), position = "fill", stat = "identity") +
  scale_y_continuous(labels = scales::percent_format()) +
  guides(fill = guide_legend(title = "Age Range")) +
  labs(title = "Stop and Searches by Age Group as Proportion of Ethnicity",
       subtitle = "White SaS tends to be focused on older people whereas Black SaS tends to be focused on younger people") +
  ylab("Proportion of Total SaS per Ethnicity Category") +
  xlab("Officer Defined Ethnicity") +
  scale_fill_brewer(palette = "Dark2") +
  theme_few() +
  theme(plot.title = element_text(size = 12, family = "serif", face = "bold"), 
        plot.subtitle = element_text(size = 9, family = "serif"),
        legend.position = "right")

# Ethnicity and Age numbers
sas_met %>%
  filter(gender %in% c("Female","Male") & !is.na(age_range) & !is.na(officer_defined_ethnicity)) %>%
  group_by(age_range,officer_defined_ethnicity) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  arrange(desc(count))
## # A tibble: 17 × 4
## # Groups:   age_range [5]
##    age_range officer_defined_ethnicity count proportion
##    <chr>     <chr>                     <int>      <dbl>
##  1 over 34   White                      1596   0.111   
##  2 18-24     Black                      1531   0.107   
##  3 18-24     White                      1319   0.0919  
##  4 25-34     White                      1317   0.0917  
##  5 10-17     Black                      1205   0.0839  
##  6 10-17     White                       893   0.0622  
##  7 25-34     Black                       860   0.0599  
##  8 over 34   Black                       824   0.0574  
##  9 18-24     Asian                       764   0.0532  
## 10 25-34     Asian                       483   0.0336  
## 11 over 34   Asian                       329   0.0229  
## 12 18-24     Other                       288   0.0201  
## 13 10-17     Asian                       279   0.0194  
## 14 25-34     Other                       186   0.0130  
## 15 10-17     Other                       135   0.00940 
## 16 over 34   Other                        89   0.00620 
## 17 under 10  Black                         2   0.000139
# Gender, Ethnicity & Age numbers
sas_met %>%
  filter(gender %in% c("Female","Male") & !is.na(age_range) & !is.na(officer_defined_ethnicity)) %>%
  group_by(gender, age_range,officer_defined_ethnicity) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  arrange(desc(count))
## # A tibble: 33 × 5
## # Groups:   gender, age_range [9]
##    gender age_range officer_defined_ethnicity count proportion
##    <chr>  <chr>     <chr>                     <int>      <dbl>
##  1 Male   18-24     Black                      1457     0.101 
##  2 Male   over 34   White                      1339     0.0933
##  3 Male   18-24     White                      1169     0.0814
##  4 Male   10-17     Black                      1154     0.0804
##  5 Male   25-34     White                      1151     0.0802
##  6 Male   10-17     White                       821     0.0572
##  7 Male   25-34     Black                       804     0.0560
##  8 Male   18-24     Asian                       737     0.0513
##  9 Male   over 34   Black                       731     0.0509
## 10 Male   25-34     Asian                       461     0.0321
## # … with 23 more rows
# Number of SaS by Gender, Ethnicity and age 
sas_met %>%
  filter(gender %in% c("Female","Male") & !is.na(age_range) & !is.na(officer_defined_ethnicity)) %>%
  group_by(age_range,officer_defined_ethnicity,gender) %>%
  summarise(count = n(), proportion = count/nrow(sas_met)) %>%
  arrange(desc(count)) %>%
  ggplot() +
  geom_point(aes(x = reorder(officer_defined_ethnicity, count), y = proportion, colour = age_range)) +
  guides(colour = guide_legend(title = "Age Range")) +
  ggh4x::facet_grid2(~gender) +
  labs(title = "Proportion of Stop and Searches by Gender, Age, and Ethnicity",
       subtitle = "Differing age groups get stopped at differing rates by gender and ethnicity") +
  xlab("Officer Defined Ethnicity") +
  ylab("Proportion of Total SaS") +
  scale_color_brewer(palette = "Dark2") +
  theme_few() +
  theme(plot.title = element_text(size = 12, family = "serif", face = "bold"), 
        plot.subtitle = element_text(size = 10, family = "serif"),
        legend.position = "right")

Age, Gender and Ethnicity of Stop and Searches

We will use Officer-defined Ethnicity here because it is the perspective of the officer that matters when exploring potential bias in policing practices. This final analysis will seek to understand how ethnicity intersects with age and gender to affect the rates at which different groups are stopped and searched.

Gender

First, it’s important to note that over 90% of all stop and searches in London during October 2022 were conducted on men. This is slightly higher than the proportion of men arrested for crimes in the U.K. in 2022, which sits at around 85% (https://www.statista.com/statistics/377412/arrests-england-and-wales-time-series-by-gender/). This is expected, but an interesting line of future inquiry would be to try to study causality behind this large disparity.

Age

18 to 24 years olds experience the most stop and searches, with 27.6% of all stop and searches in our sample being associated with suspect in that age group. Those between 25-34 and 34+ both account for about 20% of all stop and searches. 17.7% of stop and searches have a minor between the ages of 10 and 17 as the suspect. Interestingly, the distribution of stop and searches over age is surprisingly constant, relative to gender. Young men between the age of 18-24 experience more stop and searches than all other age groups. An analysis focusing purely on age would require more granular information regarding the specific age of the suspect - which is not available via the Police API. A thorough distribution analysis would require this. Otherwise, a researcher would be required to make coding decisions regarding ages derived from assumptions that would have substantive implications on analysis. Age, however, becomes a much more interesting variable when intersected with ethnicity.

Ethnicity

Controversy around stop and search in the U.K. over the last few years has been centred around two themes - protestors, and ethnicity. I will focus on the latter here. White people will be used as the general reference group here due to being the most populous group in terms of the proportion of the population. Black people are arrested at a factor of 1.79 (+179%) times more than is proportional to the population of Black people in London. Black people constitute 13.32% of the London population, but 37.2% of all stop and searches are conducted on Black people. Both Asian and White people experience being stopped and searched at a lesser rate compared to the proportion of the London population that they account for. This is suggestive of a bias in police practices to stop and search Black people at a much greater rate than other ethnic groups, especially when controlling for population. This picture gets even more interesting, however, when ethnicity is intersected with age.

The final two graphics in this analysis illustrate how people of different age, ethnicity and gender get stopped and searched at different rates. The general trend shows that for the White population, older people tend to be stopped and searched more. In contrast, young Black people tend to be stopped and searched to a greater degree than older Black people. When looking at age and ethnicity, the most stopped and searched group (by total number) are White people over the age of 34, with 1596 (11.11%) total occurrences in October 2022. Second are young Black people between the ages of 18-24, with 1531 (10.66%) total occurrences. However, when intersected with gender, the group that is most stopped and searched are Black men between the ages of 18-24, with 1457 (10.14%) occurrences.This is the only combination of gender, ethnicity and age which eclipses 10% of all total stop and searches. This makes sense as the most stopped and searched female group are White women over the age of 34, which splinters the previous most populous group (White 34+) when only looking at Ethnicity and age.

Thus, young black men experience the most stop and searches, with older white men being the second most stopped group. However, given the population imbalance between White and Black people in London, the risk of being stopped and searched, in terms of likelihood, is significantly higher for young black men between the ages of 18 and 24 compared to any other group.

The interaction between neighbourhoods, wealth and opportunity inequality, and policing practices is a salient research subject in the U.K. today. Further neighbourhood-based research where one controls for relative neighbourhood wealth disparities would add a great deal to this analysis. Unfortunately, such information is not available via the Police API and so I will pursue this project further in the future.

Conclusion

Crime in London is varied by type. Interestingly, only a small portion of crimes charged in London are associated with drugs. This is interesting considering 60.5% of all stop and searches in October 2022 were looking for drugs. Most of the crimes that were charged in October that were resolved in October and November resulted in no further action. This may be a product of the short time-to-outcome that biases towards more menial offences, but is an interesting finding nonetheless. We would benefit from a more longitudinal time-series study of charge outcomes to deal with this potential bias. Most crime in London is, perhaps unsurprisingly, committed on the streets.

As well as drugs, stop and searches are often also searching for stolen goods or weapons. These stop and searches have day-to-day temporal stability, with spikes at the weekends but the concept of mean reversion generally holds. Most stop and searches result in no further action, with arrests being the second most common outcome. The sheer numbers of stop and searches for drugs, however, when combined with the fact that few drug crimes are committed relative to others, suggests a stop and search practice bias towards the hunting of drugs.

Men are 9 times more likely to be stopped and searched than women. This is especially true for Black men. Young Black men and older White men are the most likely groups to be stopped and searched. When accounting for the population proportions, however, young Black men are at far greater risk of being stopped and searched. This is indicative of bias towards the stopping and searching of young Black men. A further line on inquiry is an expansion of this age, ethnicity and gender analysis whilst controlling for neighbourhood wealth and the gender, ethnicity and age proportions in neighbourhoods in London.